IN THE UNITED STATES PATENT AND TRADEMARK OFFICE 



In re Application of: Injong Rhee 



Group Art Unit: 2667 



Serial No.: 09/989.957 



Examiner: Grey, Christopher P. 



Filed: November 21 , 2001 



Docket No. 297/123/2 



Confirmation No.: 1668 

For: METHODS AND SYSTEMS FOR RATE-BASED FLOW CONTROL BETWEEN A 
SENDER AND A RECEIVER 



SUPPLEMENTAL INFORMATION DISCLOSURE STATEMENT 

Mail Stop RCE 
Commissioner for Patents 
P.O. Box 1450 
Alexandria, VA 22313-1450 



In accordance with 37 C.F.R. 1 .56, 1 .97, and 1 .98, applicants' undersigned attorney 
brings to the attention of the Patent and Trademark Office the documents listed on the 
attached Form PTO-1449. Copies of the references as well as Form PTO-1449 are 
attached hereto. This is not to be construed as a representation that a search has been 
made or that a reference is relevant merely because cited. 

Early passage of the subject application to issue is earnestly solicited. 



********** 



Sir: 



Serial No.: 09/989,957 

A check in the amount of $395.00 is enclosed for the Request for Continued 
Examination fee. However, the Commissioner is authorized to charge any deficiencies of 
payment or credit any overpayments associated with the filing of this correspondence to 
Deposit Account No. 50-0426 . 



Respectfully submitted, 



JENKINS, WILSON, TAYLOR & HUNT, P.A. 



Date: June 23. 2006 



By: 



Grefejoo^h. Hunt J' 
Registration No. 41,085 
Customer No. 25297 




297/123/2 GAH/sed 



Enclosures 



Page 1 of 2 



FORM PTO-1449 U.S. Department of Commerce 
ratent ana i raaemarK unice /O ^t^S 

List of Documents Cited by Applicant oCt .1* ^ «) 

V JJ 


Application No.: 


OQ/QftQ Q^7 


riling uate. 


Mn\/omher 91 9001 


First Named Inventor: 


Injong Rhee 


Group: 


2667 


Examiner: 


Grey, Christopher P. 




Attorney Docket No.: 


297/123/2 



U.S. PATENT DOCUMENTS 



Examiner 
Initial 


Cite 
No. 


Document Number 


Publication Date 


Name of Patentee or Applicant of Cited 
Document 


Pages, Columns, Lines, 
where relevant passages or 
relevant figures appear 



























FOREIGN PATENT DOCUMENTS 



Examiner 
Initials 



Cite 
No. 



Document Number 

(country code, no., kind 
code (if known) 



Publication Date 



Name of Patentee or Applicant 



Pages, columns, lines 
where relevant 
passages appear 



OTHER DOCUMENTS 



Examiner 
Initials 



Cite 
No. 



Include Author (in CAPITAL LETTERS), Title, Journal, Date, Pertinent Pages, Etc. 



1 



Whetten et al., "Reliable Multicast Transport Building Blocks for One-to-Many Bulk-Data 
Transfer," IETF Internet-Draft, http://www.ietf.org/rfc/rfc3048.txt, pgs. 1-19 (January 2001). 



Lee et al., "An Application Level Multicast Architecture for Multimedia Communications in 
the Internet," I RTF Reliable Multicast Research Group (November 1999). 



Ramesh et al., "Issues in Model-Based Flow Control," I RTF Reliable Multicast Research 
Group, pgs. 1-14 (November 1999). 



Luby et al M "Heterogeneous Multicast Congestion Control Based on Router Packet 
Filtering," I RTF Reliable Multicast Research Group, pgs. 1-13 (May 31, 1999). 



Bhattacharyya et al., "The Loss Path Multiplicity Problem for Multicast Congestion Control," 
In Proceedings of IEEE INFOCOM, pgs. 856-863 (1999). 



Padhye et al., "A Model Based TCP-Friendly Rate Control Protocol," In Proceedings of the 
Ninth International Workshop on Network and Operating Systems Support for Digital Audio 
and Video (1999). 



Page 2 of 2 





7 


Tuan et al., "Multiple Time Scale Redundancy Control for QoS-Sensitive Transport of Real- 
Time Traffic," In Proceedings of INFOCOM, pgs. 1683-1692 (2000). 






8 


Li et al., "HPF: A Transport Protocol for Supporting Heterogeneous Packet Flows in the 
Internet," Research Paper, Coordinated Sciences Laboratory, University of Illinois at 
Urbana-Champaign, pgs. 543-550 (1998). 






9 


Turletti et al., "Experiments with a Layered Transmission Scheme Over the Internet," 
Technical Report RR-3296, pgs. 1-26 (November 1997). 







10 


Mathis et al., "The Macroscopic Behavior of the TCP Congestion Avoidance Algorithm," 
ACM Computer Communication Review, 27(6), pgs. or-ot (Juiy nyy/;. 






11 


Floyd, "Connections with Multiple Congested Gateways in Packet-Switched Networks Part 
2: Two-Way Traffic," Unpublished draft, pgs. 1-8 (December 1991). 





EXAMINER : DATE CONSIDERED 

"Examiner Initial if reference considered, whether or not citation is in conformance with MPEP 609; draw line through 
citation if not in conformance and not considered. Include copy of this form with next communication to applicant. 



Page 1 of 2 



FORM PTO-1449 U.S. Department of Commerce 
r aisni ana i raQciiiaift uimog xOT^ \. 

List of Documents Cited by Applic^ntOCT.2. S 2006 \ 
^ 


Application ino.. 




riling uaic. 


November 21 2001 


First Named Inventor: 


Injong Rhee 


Group: 


2667 


Examiner: 


Grey, Christopher P. 




Attorney Docket No.: 


297/123/2 



U.S. PATENT DOCUMENTS 



Examiner 
Initial 


Cite 
No. 


Document Number 


Publication Date 


Name of Patentee or Applicant of Cited 
Document 


Pages, Columns, Lines, 
where relevant passages or 
relevant figures appear 



























FOREIGN PATENT DOCUMENTS 



Examiner 
Initials 



Cite 
No. 



Document Number 

(country code, no., kind 
code (if known) 



Publication Date 



Name of Patentee or Applicant 



Pages, columns, lines 
where relevant 
passages appear 



OTHER DOCUMENTS 



Examiner 
Initials 


Cite 
No. 


Include Author (in CAPITAL LETTERS), Title, Journal, Date, Pertinent Pages, Etc. 


T 




1 


Whetten et al., "Reliable Multicast Transport Building Blocks for One-to-Many Bulk-Data 
Transfer," IETF Internet-Draft, http://www.ietf.org/rfc/rfc3048.txt, pgs. 1-19 (January 2001). 






2 


Lee et al., "An Application Level Multicast Architecture for Multimedia Communications in 
the Internet," I RTF Reliable Multicast Research Group (November 1999). 






3 


Ramesh et al., "Issues in Model-Based Flow Control," I RTF Reliable Multicast Research 
Group, pgs. 1-14 (November 1999). 






4 


Luby et al., "Heterogeneous Multicast Congestion Control Based on Router Packet 
Filtering," I RTF Reliable Multicast Research Group, pgs. 1-13 (May 31, 1999). 






5 


Bhattacharyya et al., "The Loss Path Multiplicity Problem for Multicast Congestion Control," 
In Proceedings of IEEE INFOCOM, pgs. 856-863 (1999). 






6 


Padhye et al., "A Model Based TCP-Friendly Rate Control Protocol," In Proceedings of the 
Ninth International Workshop on Network and Operating Systems Support for Digital Audio 
and Video (1999). 





Page 2 of 2 





7 


Tuan et al., "Multiple Time Scale Redundancy Control for QoS-Sensitive Transport of Real- 
Time Traffic," In Proceedings of INFOCOM, pgs. 1683-1692 (2000). 






8 


Li et al., "HPF: A Transport Protocol for Supporting Heterogeneous Packet Flows in the 
Internet," Research Paper, Coordinated Sciences Laboratory, University of Illinois at 
Urbana-Champaign, pgs. 543-550 (1998). 






9 


Turletti et al., "Experiments with a Layered Transmission Scheme Over the Internet," 
Technical Report RR-3296, pgs. 1-26 (November 1 997). 






10 


Mathis et al., "The Macroscopic Behavior of the TCP Congestion Avoidance Algorithm," 
ACM Computer Communication Review, 27(3), pgs. 67-82 (July 1997). 






11 


Floyd, "Connections with Multiple Congested Gateways in Packet-Switched Networks Part 
2: Two-Way Traffic," Unpublished draft, pgs. 1-8 (December 1991). 




EXAMINER 


DATE CONSIDERED 



'Examiner Initial if reference considered, whether or not citation is in conformance with MPEP 609; draw line through 
citation if not in conformance and not considered. Include copy of this form with next communication to applicant. 




Page 1 of 19 



Network Working Group 
Request for Comments: 304 8 
Category: Informational 



B. Whetten 
Talarian 
L. Vicisano 
Cisco 
R. Kermode 
Motorola 
M. Handley 



ACIRI 9 
S. Floyd 



ACIRI 
M . Luby 
Digital Fountain 
January 2001 



Reliable Multicast Transport Building Blocks for One-to-Many 

Bulk-Data Transfer 

Status of this Memo 

This memo provides information for the Internet community. It does 
not specify an Internet standard of any kind. Distribution of this 
memo is unlimited. 

Copyright Notice 

Copyright (C) The Internet Society (2001) . All Rights Reserved. 



This document describes a framework for the standardization of bulk- 
data reliable multicast transport. It builds upon the experience 
gained during the deployment of several classes of contemporary 
reliable multicast transport, and attempts to pull out the 
commonalities between these classes of protocols into a number of 
building blocks. To that end, this document recommends that certain 
components that are common to multiple protocol classes be 
standardized as "building blocks". The remaining parts of the 
protocols, consisting of highly protocol specific, tightly 
intertwined functions, shall be designated as "protocol cores". 
Thus, each protocol can then be constructed by merging a "protocol 
core" with a number of "building blocks" which can be re-used across 
multiple protocols. 
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1. Introduction 

RFC 2357 lays out the requirements for reliable multicast protocols 
that are to be considered for standardization by the IETF. They 
include : 

o Congestion Control. The protocol must be safe to deploy in the 
widespread Internet. Specifically, it must adhere to three 
mandates: a) it must achieve good throughput (i.e., it must not 
consistently overload links with excess data or repair traffic) , 
b) it must achieve good link utilization, and c) it must not 
starve competing flows. 

o Scalability. The protocol should be able to work under a variety 
of conditions that include multiple network topologies, link 
speeds, and the receiver set size. It is more important to have a 
good understanding of how and when a protocol breaks than when it 
works . 
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o Security. The protocol must be analyzed to show what is necessary 
to allow it to cope with security and privacy issues. This 
includes understanding the role of the protocol in data 
confidentiality and sender authentication, as well as how the 
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protocol will provide defenses against denial of service attacks. 

These requirements are primarily directed towards making sure that 
any standards will be safe for widespread Internet deployment. The 
advancing maturity of current work on reliable multicast congestion 
control (RMCC) [HFW99] in the IRTF Reliable Multicast Research Group 
(RMRG) has been one of the events that has allowed the IETF to 
charter the RMT working group. RMCC only addresses a subset of the 
design space for reliable multicast. Fortuitously, the requirements 
it addresses are also the most pressing application and market 
requirements . 

A protocol's ability to meet the requirements of congestion control, 
scalability, and security is affected by a number of secondary 
requirements that are described in a separate document [RFC2887] . In 
summary, these are: 

o Ordering Guarantees. A protocol must offer at least one of either 
source ordered or unordered delivery guarantees. Support for 
total ordering across multiple senders is not recommended, as it 
makes it more difficult to scale the protocol, and can more easily 
be implemented at a higher level. 

o Receiver Scalability. A protocol should be able to support a 
"large" number of simultaneous receivers per transport group. A 
typical receiver set could be on the order of at least 1,000 - 
10,000 simultaneous receivers per group, or could even eventually 
scale up to millions of receivers in the large Internet. 

o Real-Time Feedback. Some versions of RMCC may require soft real- 
time feedback, so a protocol may provide some means for this 
information to be measured and returned to the sender. While this 
does not require that a protocol deliver data in soft real-time, 
it is an important application requirement that can be provided 
easily given real-time feedback. 

o Delivery Guarantees. In many applications, a logically defined 
unit or units of data is to be delivered to multiple clients, 
e.g., a file or a set of files, a software package, a stock quote 
or package of stock quotes, an event notification, a set of 
slides, a frame or block from a video. An application data unit 
is defined to be a logically separable unit of data that is useful 
to the application. In some cases, an application data unit may 
be short enough to fit into a single packet (e.g., an event^ 
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notification or a stock quote) , whereas in other cases an 
application data unit may be much longer than a packet (e.g., a 
software package) . A protocol must provide good throughput of 
application data units to receivers. This means that most data 
that is delivered to receivers is us'eful in recovering the 
application data unit that they are trying to receive, A protocol 
may optionally provide delivery confirmation, i.e., a mechanism 
for receivers to inform the sender when data has been delivered. 
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There are two types of confirmation, at the application data unit 
level and at the packet level. Application data unit confirmation 
is useful at the application level, e.g., to inform the 
application about receiver progress and to decide when to stop 
sending packets about a particular application data unit. Packet 
confirmation is useful at the transport level, e.g., to inform the 
transport level when it can release buffer space being used for 
storing packets for which delivery has been confirmed. Packet 
level confirmation may also aid in application data unit 
confirmation. 

o Network Topologies. A protocol must not break the network when 
deployed in the full Internet. However, we recognize that 
intranets will be where the first wave of deployments happen, and 
so are also very important to support. Thus, support for 
satellite networks {including those with terrestrial return paths 
or no return paths at all) is encouraged, but not required. 

o Group Membership. The group membership algorithms must be 

scalable. Membership can be anonymous (where the sender does not 
know the list of receivers) , or fully distributed (where the 
sender receives a count of the number of receivers, and optionally 
a list of failures) . 

o Example Applications. Some of the applications that a RM protocol 
could be designed to support' include multimedia broadcasts, real 
time financial market data distribution, multicast file transfer, 
and server replication. 

In the rest of this document the following terms will be used with a 
specific connotation: "protocol family", "protocol component" , 
"building block", "protocol core", and "protocol instantiation". A 
"protocol family" is a broad class of RM protocols which share a 
common characteristic. In our classification, this characteristic is 
the mechanism used to achieve reliability. A "protocol component" is 
a logical part of the protocol that addresses a particular 
functionality. A "building block" is a constituent of a protocol 
that implements one, more than one or a part of a component. A 
"protocol core" is the set of functionality required for the 
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instantiation of a complete protocol, that is not specified by any 
building block. Finally a "protocol instantiation" is an actual RM 
protocol defined in term of building blocks and a protocol core. 

1.1. Protocol Families 

The design-space document [RFC2887] also provides a taxonomy of the 
most popular approaches that have been proposed over the last ten 
years. After congestion control, the primary challenge has been that 
of meeting the requirement for ensuring good throughput in a way that 
scales to a large number of receivers. For protocols that include a 
back-channel for recovery of lost packets, the ability to take 
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advantage of support of elements in the network has been found to be 
very beneficial for supporting good throughput for a large numbers of 
receivers. Other protocols have found it very beneficial to transmit 
coded data to achieve good throughput for large numbers of receivers. 

This taxonomy breaks proposed protocols into four families. Some 
protocols in the family provide packet level delivery confirmation 
that may be useful to the transport level. All protocols in all 
families can be supplemented with higher level protocols that provide 
delivery confirmation of application data units. 

1 NACK only. Protocols such as SRM [FJM95] and MDP2 [MA99] attempt 
to limit traffic by only using NACKs for requesting packet 
retransmission. They do not require network infrastructure. 

2 Tree based ACK. Protocols such as RMTP [LP96, PSLM97] ( RMTP-II 
[WBPM98] and TRAM [KCW98] , use positive acknowledgments (ACKs) . 
ACK based protocols reduce the need for supplementary protocols 
that provide delivery confirmation, as the ACKS can be used for 
this purpose. In order to avoid ACK implosion in scaled up 
deployments, the protocol can use servers placed in the network. 

3 Asynchronous Layered Coding (ALC) . These protocols (examples 
include [RV97] and [BLMR98] ) use sender-based Forward Error 
Correction (FEC) methods with no feedback from receivers or the 
network to ensure good throughput. These protocols also used 
sender-based layered multicast and receiver-driven protocols to 
join and leave these layers with no feedback to the sender to 
achieve scalable congestion control. 

4 Router assist. Like SRM, protocols such as PGM [FLST9B] and 
[LG97] also use negative acknowledgments for packet recovery. 
These protocols take advantage of new router software to do 
constrained negative acknowledgments and retransmissions. Router 
assist protocols can also provide other functionality more 
efficiently than end to end protocols. For example, [LVS99] shows 
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how router assist can provide fine grained congestion control for 
ALC protocols. Router assist protocols can be designed to 
complement all protocol families described above. 

Note that the distinction in protocol families in not necessarily 
precise and mutually exclusive. Actual protocols may use a 
combination of the mechanisms belonging to different classes. For 
example, hybrid NACK/ACK based protocols (such as [WBPM98] ) are 
possible. Other examples are protocols belonging to class 1 through 
3 that take advantage of router support . 

2. Building Blocks Rationale 

As specified in RFC 2357 [MRBP98] , no single reliable multicast 
protocol will likely meet the needs of all applications. Therefore, 
the IETF expects to standardize a number of protocols that are 
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tailored to application and network specific needs. This document 
concentrates on the requirements for "one-to-many bulk-data 
transfer" , but in the future, additional protocols and building- 
blocks are expected that will address the needs of other types of 
applications, including "many- to- many" applications. Note that 
bulk-data transfer does not refer to the timeliness of the data, 
rather it states that there is a large amount of data to be 
transferred in a session. The scope and approach taken for the 
development of protocols for these additional scenarios will depend 
upon large part on the success of the "building-block" approach put 
forward in this document . 

2.1. Building Blocks Advantages 

Building a large piece of software out of smaller modular components 
is a well understood technique of software engineering. Some of the 
advantages that can come from this include: 

0 

o Specification Reuse. Modules can be used in multiple protocols, 
which reduces the amount of development time required. 

o Reduced Complexity. To the extent that each module can be easily 
defined with a simple API, breaking a large protocol in to smaller 
pieces typically reduces the total complexity of the system. 

o Reduced Verification and Debugging Time. Reduced complexity 

results in reduced time to debug the modules. It is also usually 
faster to verify a set of smaller modules than a single larger 
module . 
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o Easier Future Upgrades. There is still ongoing research in 

reliable multicast, and we expect the state of the art to continue 
to evolve. Building protocols with smaller modules allows them to 
be more easily upgraded to reflect future research. 

o Common Diagnostics. To the extent that multiple protocols share 
common packet headers, packet analyzers and other diagnostic tools 
can be built which work with multiple protocols. 

o Reduces Effort for New Protocols. As new application requirements 
drive the need for new standards, some existing modules may be 
reused in these protocols. 

o Parallelism of Development. If the APIs are defined clearly, the 
development of each module can proceed in parallel. 

2.2. Building Block Risks 

Like most software specification, this technique of breaking down a 
protocol in to smaller components also brings tradeoffs. After a 
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certain point, the disadvantages outweigh the advantages, and it is 
not worthwhile to further subdivide a problem. These risks include: 

o Delaying Development. Defining the API for how each two modules 
inter-operate takes time and effort. As the number of modules 
increases, the number of APIs can increase at more than a linear 
rate. The more tightly coupled and complex a component is, the 
more difficult it is to define a simple API , and the less 
opportunity there is for reuse. In particular, the problem of how 
to build and standardize fine grained building blocks for a 
transport protocol is a difficult one, and in some cases requires 
fundamental research . 

o Increased Complexity. If there are too many modules, the total 
complexity of the system actually increases, due to the 
preponderance of interfaces between modules . 

o Reduced Performance. Each extra API adds some level of processing 
overhead. If an API is inserted in to the "common case" of packet 
processing, this risks degrading total protocol performance. 

o Abandoning Prior Work. The development of robust transport 

protocols is a long and time intensive process, which is heavily 
dependent on feedback from real deployments. A great deal of work 
has been done over the past five years on components of protocols 
such as RMTP-II, SRM, and PGM. Attempting to dramatically re- 
engineer these components risks losing the benefit of this prior 
work. 
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2.3. Building Block Requirements 

Given' these tradeoffs, we propose that a building block must meet the 
following requirements: 

o Wide Applicability. In order to have confidence that the 

component can be reused, it should apply across multiple protocol 
families and allow for the component's evolution. 

o Simplicity. In order to have confidence that the specification of 
the component APIs will not dramatically slow down the standards 
process, APIs must be simple and straight forward to define. No 
new fundamental research should be done in defining these APIs. 

o Performance. To the extent possible, the building blocks should 
attempt to avoid breaking up the "fast track", or common case 
packet processing. 

3 . Protocol Components 

This section proposes a functional decomposition of RM bulk-data 
protocols from the perspective of the functional components provided 
to an application by a transport protocol. It also covers some 
components that while not necessarily part of the transport protocol, 
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are directly impacted by the specific requirements of a reliable 
multicast transport. The next section then specifies recommended 
building blocks that can implement these components. 

Although this list tries to cover all the most common transport - 
related needs of one- to-many bulk-data transfer applications, new 
application requirements may arise during the process of 
standardization, hence this list must not be interpreted as a 
statement of what the transport layer should provide and what it 
should not. Nevertheless, it must be pointed out that some 
functional components have been deliberately omitted since they have 
been deemed irrelevant to the type of application considered (i.e., 
one-to-many bulk-data transfer) . Among these are advanced message 
ordering (i.e., those which cannot be implemented through a simple 
sequence number) and atomic delivery. 

It is also worth mentioning that some of the functional components 
listed below may be required by other functional components and not 
directly by the application (e.g., membership knowledge is usually 
required to implement ACK-based reliability) . 

The following list covers various transport functional components and 
splits them in sub- components . 
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o Data Reliability (ensuring good throughput) 



I 



o Security 

o Group membership 



i 



o Congestion Control | 



i 



o Session Management | 

i 



o Tree Configuration 



Loss Detection/Notification 
Loss Recovery 
Loss Protection 



Congestion Feedback 
Rate Regulation 
Receiver Controls 



Membership Notification 
Membership Management 



Group Membership Tracking 

Session Advertisement 

Session Start/Stop 

Session Configuration/Monitoring 



Note that not all components are required by all protocols/ depending 
upon the fully defined service that is being provided by the 
protocol. In particular, some minimal service models do not require 
many of these functions, including loss notification, loss recovery, 
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and group membership. 

3.1. Sub-Components Definition 

Loss Detection/Notification. This includes how missing packets are 
detected during transmission and how knowledge of. these events are 
propagated to one or more agents which are designated to recover from 
the transmission error. This task raises major scalability issues 
and can lead to feedback implosion and poor throughput if not 
properly handled. Mechanisms based on TRACKS {tree-based positive 
acknowledgements) or NACKs (negative acknowledgements) are the most 
•widely used to perform this function. Mechanisms based on a 
combination of TRACKS and NACKs are also possible. 

Loss Recovery. This function responds to loss notification events 
through the transmission of additional packets, either in the form of 
copies of those packets lost or in the form of FEC packets. The 
manner in which this function is implemented can significantly affect 
the scalability of a protocol. 
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Loss Protection. This function attempts to mask packet-losses so 
that they don't become Loss Notification events. This function can - 
be realized through the pro-active transmission of FEC packets. Each 
FEC packet is created from an entire application data unit [LMSSS97] 
or a portion of an application data unit [RV97] , [BKKKLZ95] , a fact 
that allows a receiver to recover from some. packet loss without 
further retransmissions. The number of losses that can be recovered 
from without requiring retransmission depends on the amount of FEC 
packets sent in the first place. Loss protection can also be pushed 
to the extreme when good throughput is achieved without any Loss 
Detection/Notification and Loss Recovery functionality, as in the ALC 
family of protocols defined above. 

Congestion Feedback. For sender driven congestion control protocols, 
the receiver must provide some type of feedback on congestion to the 
sender. This typically involves loss rate and round trip time 
measurements . 

Rate Regulation. Given the congestion feedback, the sender then must 
adjust its rate in a way that is fair to the network. One proposal 
that defines this notion of fairness and other congestion control 
requirements is [Whetten99] . 

Receiver Controls. In order to avoid allowing a receiver that has an 
extremely slow connection to the sender to stop all progress within 
single rate schemes, a congestion control algorithm will often 
require receivers to leave groups. For multiple rate approaches, 
receivers of all connection speeds can have data delivered to them 
according to the rate of their connection without slowing down other 
receivers . 

Security. Security for reliable multicast contains a number of 



http://vvww.ietf.org/rfc/rfc3048.txt 



1/27/2006 



Page 10 of 19 



complex and tricky issues that stem in large part from the IP 
multicast service model. In this service model, hosts do not send 
traffic to another host, but instead elect to receive traffic from a 
multicast group. This means that any host may join a group and 
receive its traffic. Conversely, hosts may also leave a group at any 
time. Therefore, the protocol must address how it impacts the 
following security issues: 

o Sender Authentication (since any host can send to a group) , 

o Data Encryption (since any host can join a group) 

o Transport Protection (denial of service attacks, through 
corruption of transport state, or requests for unauthorized 
resources) 
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o Group Key Management (since hosts may join and leave a group at 
any time) [WHA98] 

In particular, a transport protocol needs to pay particular attention 
to how it protects itself from denial of service attacks, through 
mechanisms such as lightweight authentication of control packets 
[HW99] . 

With Source Specific Multicast service model (SSM) , a host joins 
specifically to a sender and group pair. Thus, SSM offers more 
security against hosts receiving traffic from a denial of service 
attack where an arbitrary sender sends packets that hosts did not 
specifically request to receive. Nevertheless, it is recommended 
that additional protections against such attacks should be provided 
when using SSM, because the protection offered by SSM against such 
attacks may not be enough. 

Sender Authentication, Data Encryption, and Group Key Management. 
While these functions are not typically part of the transport layer 
per se, a protocol needs to understand what ramifications it has on 
data security, and may need to have special interfaces to the 
security layer in order to accommodate these ramifications. 

Transport Protection. The primary security task for a transport 
layer is that of protecting the transport layer itself from attack. 
The most important function for this is typically lightweight 
authentication of control packets in order to prevent corruption of 
state and other denial of service attacks. 

Membership Notification. This is the function through which the data 
source- -or upper level agent in a possible hierarchical 
organization- -learns about the identity and/or number of receivers or 
lower level agents. To be scalable, this typically will not provide 
total knowledge of the identity of each receiver. 

Membership Management. This implements the mechanisms for members to 
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join and leave the group, to accept/refuse new members, or to 
terminate the membership of existing members. 

Group Membership Tracking. As an optional feature, a protocol may 
interface with a component which tracks the identity of each receiver 
in a large group. If so, this feature will typically be implemented 
out of band, and may be implemented by an upper level protocol. This 
may be useful for services that require tracking of usage of the 
system, billing, and usage reports. 
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Session Advertisement. This publishes the session name/contents and 
the parameters needed for its reception. This function is usually 
performed by an upper layer protocol (e.g., [HPW99] and [HJ98] ) . 

Session Start/Stop. These functions determine the start/stop time of 
sender and/or receivers. In many cases this is implicit or performed 
by an upper level application or protocol. In some protocols, 
however, this is a task best performed by the transport layer due to 
scalability requirements. 

Session Configuration/Monitoring. Due to the potentially far 
reaching scope of a multicast session, it is particularly important 
for a protocol to include tools for configuring and monitoring the 
protocol's operation. 

Tree Configuration. For protocols which include hierarchical 
elements (such as PGM and RMTP-II) , it is important to configure 
these elements in a way that has approximate congruence with the 
multicast routing topology. While tree configuration could be 
included as part of the session configuration tools, it is clearly 
better if this configuration can be made automatic. 

4. Building Block Recommendations 

The families of protocols introduced in section l.l generally use 
different mechanisms to implement the protocol functional components 
described in section 3. This section tries to group these mechanisms 
in macro components that define protocol building blocks. 

A building block is defined as 

"a logical protocol component that results in explicit APIs for use 
by other building blocks or by the protocol client." 

Building blocks are generally specified in terms of the set of 
algorithms and packet formats that implement protocol functional 
components. A building block may also have API's through which it 
communicates to applications and/or other building blocks. Most 
building blocks should also have a management API , through which it 
communicates to SNMP and/or other management protocols. 
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In the following section we will list a number of building blocks 
which, at this stage, seem to cover most of the functional components 
needed to implement the protocol families presented in section 1.1. 
Nevertheless this list represents the "best current guess", and as 
such it is not meant to be exhaustive. The actual building block 
decomposition, i.e., the division of functional components into 
building blocks, may also have to be revised in the future. 
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4.1. NACK-based Reliability 

This building block defines NACK-based loss detection/notification . 
and recovery. The major issues it addresses are implosion prevention 
(suppression) and NACK semantics (i.e., how packets to be 
retransmitted should be specified, both in the case of selective and 
FEC loss repair) . Suppression mechanisms to be considered are: 

o Multicast NACKs 

o Unicast NACKs and Multicast confirmation 

These suppression mechanisms primarily need to both minimize delay 
while also minimizing redundant messages. They may also need to have 
special weighting to work with Congestion Feedback. 

4.2. FEC coding 

This building block is concerned with packet level FEC information 
when FEC codes are used either proactively or as repairs in reaction 
to lost packets. It specifies the FEC codec selection and the FEC 
packet naming (indexing) for both reactive FEC repair and pro-active 
FEC. 

4.3. Congestion Control 

There will likely be multiple versions of this building block, 
corresponding to different design policies in addressing congestion 
control. Two main approaches are considered for the time being: a 
source-based rate regulation with a single rate provided to all the 
receivers in the session, and a multiple rate receiver-driven 
approach with different receivers receiving at different rates in the 
same session. The multiple rate approach may use multiple layers of 
multicast traffic [VRC98] or router filtering of a single layer 
[LVS99] . The multiple rate approach is most applicable for ALC 
protocols . 

Both approaches are still in the phase of study, however the first 
seems to be mature enough [HFW99] to allow the standardization 
process to begin. 

At the time of writing this document, a third class of congestion 
control algorithm based on router support is beginning to emerge in 
the I RTF RMRG [LVS99] . This work may lead to the future 
standardization of one or more additional building blocks for 
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4.4. Generic Router Support 

The task of designing RM protocols can be made much easier by the 
presence of some specific support in routers. In some application- 
specific cases, the increased benefits afforded by the addition of 
special router support can justify the resulting additional 
complexity and expense [FLST98] . 

Functional components which can take advantage of router support 
include feedback aggregation/ suppress ion (both for loss notification 
and congestion control) and constrained retransmission of repair 
packets. Another component that can take advantage of router support 
is intentional packet filtering to provide different rates of 
delivery of packets to different receivers from the same multicast 
packet stream. This could be most advantageous when combined with 
ALC protocols [LVS99] . 

The process of designing and deploying these mechanisms inside 
routers can be much slower than the one required for end-host 
protocol mechanisms. Therefore, it would be highly advantageous to 
define these mechanisms in a generic way that multiple protocols can 
use if it is available, but do not necessarily need to depend on. 

This component has two halves, a signaling protocol and actual router 
algorithms. The signaling protocol allows the transport protocol to 
request from the router the functions that it wishes to perform, and 
the router algorithms actually perform these functions. It is more 
urgent to define the signaling protocol, since it will likely impact 
the common case protocol headers. 

An important component of the signaling protocol is some level of 
commonality between the packet headers of multiple protocols, which 
allows the router to recognize and interpret the headers. 

4.5. Tree Configuration 

It has been shown that the scalability of RM protocols can be greatly 
enhanced by the insertion of some kind of retransmission or feedback 
aggregation agents between the source and receivers. These agents 
are then used to form a tree with the source at (or near) the root, 
the receivers at the leaves of the tree, and the aggregation/local 
repair nodes in the middle. The internal nodes can either be 
dedicated software for this task, or they may be receivers that are 
performing dual duty. 

The effectiveness of these agents to assist in the delivery of data 
is highly dependent upon how well the logical tree they use to 
communicate matches the underlying routing topology. The purpose of 
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this building block would be to construct and manage the logical tree 
connecting the agents. Ideally, this building block would perform 
these functions in a manner that adapts to changes in session 
membership, routing topology, and network availability. 

4.6. Data Security 

At the time of writing, the security issues are the subject of 
research within the I RTF Secure Multicast Group (SMuG) . Solutions 
for these requirements will be standardized within the IETF when 
ready. 

4.7. Common Headers 

As pointed out in the generic router support section, it is important 
to have some level of commonality across packet headers. It may also 
be useful to have common data header formats for other reasons. This 
building block would consist of recommendations on fields in their 
packet headers that protocols should make common across themselves . 

4.8. Protocol Cores 

The above building blocks consist of the functional components listed 
in section 3 that appear to meet the requirements for being 
implemented as building blocks presented in section 2 , 

The other functions from section 3, which are not covered above, 
should be implemented as part of "protocol cores", specific to each 
protocol standardized. 

5. Security Considerations 

RFC 2357 specifically states that "reliable multicast Internet-Drafts 
reviewed by the Transport Area Directors must explicitly explore the 
security aspects of the proposed design." Specifically, RMT building 
block works in progress must examine the denial-of -service attacks 
that can be made upon building blocks and affected by building blocks 
upon the Internet at large. This requirement is in addition to any 
discussions regarding data-security, that is the manipulation of or 
exposure of session information to unauthorized receivers. Readers 
are referred to section 5.e of RFC 2357 for further details. 

6 . IANA Considerations 

There will be more than one building block, and possibly multiple 
versions of individual building blocks as their designs are refined. 
For this reason, the creation of new building blocks and new building 
block versions will be administered via a building block registry 
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that will be administered by I ANA. .Initially, this registry will be 
empty, since the building blocks described in sections 4.1 to 4.3 are 
presented for example and design purposes. The requested I ANA 
building block registry will be populated from specifications as they 
are approved for RFC publication (using the "Specification Required" 
policy as described in RFC 2434 [RFC2434] ) , A registration will 
consist of a building block name, a version number, a brief text 
description, a specification RFC number, and a responsible person, to 
which I ANA will assign the type number. 



7. Conclusions 



In this document, we briefly described a number of building blocks 
that may be used for the generation of reliable multicast protocols 
to be used in the application space of one-to-many reliable bulk-data 
transfer. The list of building blocks presented was derived from 
considering the functions that a protocol in this space must perform 
and how these functions should be grouped. This list is not intended 
to be all-inclusive but instead to act as guide as to which building 
blocks are considered during the standardization process within the 
Reliable Multicast Transport WG. 
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Abstract— In this paper, we propose an application level 
multicast architecture for multimedia data communica- 
tions called MHPF (Multicast Heterogeneous Packet Flows), 
which provides the following services: 

• MHPF provides an overlay network that enables applica- 
tion level multicast over the Internet. 

• MHPF efficiently supports heterogeneity in the multicast 
group by ensuring that each receiver receives the highest pri- 
ority portion of the multimedia data stream that its connec- 
tion quality can sustain. 

o In order to support interactive applications, MHPF pro- 
vides for loose synchronization among the receivers, where 
each receiver perceives approximately the same progress of 
the multicast session. 

• At the same time, MHPF guarantees the delivery of the 
packets in the multimedia data stream that are marked as 
reliable by the application. 

Unlike most contemporary approaches, MHPF achieves 
these features without decomposing the heterogeneous mul- 
timedia stream into component layers; instead it handles the 
multimedia stream as a single heterogeneous data stream 
with the aid of HPF transport protocol at the end hosts and 
the special functionality at the designated MHPF multicast 
servers. We present a set of simulation-based performance 
results to illustrate that MHPF achieves its design goals. 



I. Introduction 

hi recent years, interactive and real-time multimedia 
applications such as video teleconferencing, multimedia 
streaming, distributed shared whiteboard, and have be- 
come increasingly popular in the Internet The emergence 
of such applications has created two new phenomena in the 
Internet: (a) multicast support is becoming critical, since 
such applications typically have multiple participants, and 
(b) the data streams are becoming heterogeneous in na- 
ture (e.g. a typical video conference session consists of 
text data, multi-resolution audio and video streams with 
different reliability and priority requirements). Our target 
set of applications thus requires sophisticated multimedia 
support for small to medium multicast groups. * 

Supporting multicast communications of interactive 
multimedia applications thus introduces unique challenges 
hitherto unaddressed in traditional network architectures 
and transport protocols: 

• Multicast groups consist of heterogeneous receivers, i.e. 
different receivers in the same multicast group may have 



different processing capability and may perceive different 
connection qualities. 

• Multimedia data streams are often multi-resolution, con- 
sisting of interleaved sub-streams with different priority 
requirements, where lower priority streams progressively 
improve the perceived data quality at the receiver, e.g. 
I/P/B frames of an MPEG stream. 

Taking the environment and application requirements 
into consideration, we have identified three goals that our 
architecture for multicast heterogeneous streams must sat- 
isfy: 

1. Different receivers of a multicast group may perceive 
different connection qualities, and each receiver must re- 
ceive the highest priority portion of the heterogeneous data 
stream that its connection quality can sustain. 

2. The receivers in the same group must be loosely syn- 
chronized, i.e. they must perceive approximately the same 
progress of the multimedia session. In particular, it is 
preferable for a poorly connected receiver to lose more 
frames but be synchronized with other receivers rather 
than receive all the frames and lag behind. 

3. In addition, all the packets that are marked reliable in 
the heterogeneous packet stream by the application need 
to be delivered reliably to all the receivers. 

These three characteristics serve as our metrics for eval- 
uating existing solutions, and as guidelines for the design 
of our multicast architecture. 

The standard DP multicast protocol only provides best- 
effort delivery of multicast data. Recent transport layer 
proposals designed to address at least some of the issues 
Usted above include RTP [1], DSG, [2], RLM [3] (for 
real-time/streaming data delivery), and a variety of (semi- 
) reliable multicast transport protocols [4], [5] (for error- 
resilient delivery). 

In related work, a layered multicast approach is becom- 
ing very popular for multicasting prioritized data streams 
in the Internet [3], [6], [7], [8], [9], [10]. Essentially, the 
approach is to decompose the multi-resolution data stream 
(e.g. an MPEG stream) into component single resolution 
streams (e.g. I frame, P frame, and B frame streams), 
establish a distinct multicast group for each component 
stream using standard IP multicast, and let each receiver 
independently decide which multicast groups it wants to 
join based on the perceived connection quality. This ap- 
proach has the charm of being intuitively simple, and of 



not requiring special mechanisms other than standard IP 
multicast functionality from the network. However, it 
also has several inherent limitations imposed by layering 
such as the coarse granularity of adaptation to network dy- 
namics, potentially slow reaction to changes in connection 
quality, and destructive interference by concurrent adapta- 
tion by multiple receivers [8], [9], [10]. We revisit these 
issues in the next section. 

In this paper, we present the MHPF (Multicast Hetero- 
geneous Packet Flows) architecture that supports multi- 
cast communication 1 ! of heterogeneous data streams for 
small to medium multicast groups witfiout decomposing 
tliem into component homogeneous data streams. MHPF 
exhibits the three elements of the 'desired' behavior of the 
multimedia multicast communications mentioned above 
without incurring the drawbacks of (he layered approach 
but at the expense of introducing additional functional- 
ity in specialized "MHPF servers" MHPF is composed 
of two key components: (a) an adaptive transport proto- 
col called HPF (Heterogeneous Packet Flows) [12], which 
provides end-to-end transport of heterogeneous packet 
flows, and (b) a virtual network of collaborating MHPF 
servers, which provides application level multicast support 
mechanisms for heterogeneous packet flows. The focus of 
this paper is the latter component 

In summary, related work has typically taken a minimal- 
ist approach, and tried to answer the question: how can we 
best support multimedia multicast communications with- 
out adding any special mechanisms in the network? We 
approach the problem from the opposite perspective: what 
are the minimum mechanisms that need to be implemented 
by specialized multicast servers in the overlay network in 
order to support the desired service for multicast multi- 
media communications? To this end, our first goal is to 
achieve the desired service for multimedia multicast iden- 
tified above; subject to this goal, our second goal is to min- 
imize the state management and computation complexity 
of the mechanisms in the multicast servers. Finally, we 
compare the two approaches to see whether we are able to 
achieve justifiable performance improvements at accept- 
able cost. In order to achieve incremental deployability 
over the current Internet, our approach is to enable special- 
ized multicast services in "application level" MHPF-aware 
servers that provide an overlay network. At the outset, we 
recognize that the mechanisms proposed in this paper can 
be instantiated either in the multicast routers themselves 
or in the application level servers. Our choice has been 
motivated by engineering and deployability concerns. 

The remainder of the paper is organized as follows: Sec- 
tion II presents the related work. Section HI presents the 
service model and the design choices for the MHPF ar- 
chitecture. Section IV details the design of the MHPF 
architecture focusing on the network level functionality. 
Section V presents a simulation-based performance evalu- 



ation. Section VI summarizes the issues and trade-offs. 

II. Related Work 

Since multi-resolution streaming is becoming very pop- 
ular in the multimedia coding community, future networks 
must support heterogeneous multicast data streams to en- 
able multi-party multimedia applications. Related work in 
this area can be categorized in the following three ways: 
(a) sender-based single rate multicast approach [14], (b) 
replicated multiple multicast group approach [2], and (c) 
layered multicast approach [3], [6], [7], [8], [9], [10], [1 1]. 

The first approach is limited since the rate control by the 
sender often reacts slowly to network dynamics and cannot 
provide optimal performance in the multicast environment 
due to the inter-receiver fairness problem [15], and hence 
violates the first point of our goals. The second approach 
uses multiple replicated video streams with different qual- 
ity multicast to distinct groups and lets each receiver de- 
cide which group to join based on its connection capabil- 
ity. Although this scheme is free from the inter-receiver 
fairness issue, the bandwidth efficiency may be poor when 
there are multiple replicated streams on the same path, i.e. 
inefficient achieving the first goal of the desired system. 
Hence most contemporary work adopts the third approach. 

In layered multicasting [3], the sender decomposes the 
video stream into multiple component video substreams 
(or layers), so that the lowest layer provides the low res- 
olution (and high priority) data and the additional layer 
progressively improves the picture quality, and establishes 
a distinct multicast group for each component stream. 
Each receiver independently decides how many multicast 
groups to join according to its connection quality: when 
a receiver sees a congestion it drops the highest layer and 
when it does not see congestions for a while it joins the 
next layer. This approach has the charm of being intu- 
itively simple and efficient in terms of network bandwidth 
usage since layers are cumulative, and of not requiring any 
special support mechanism other than standard DP multi- 
cast routing functionality from the network. 

However, researchers have found a list of problems with 
layered multicasting [8], [9], [10], [11]: (a) coarse grain 
rate adaptation due to join/leave latency, 1 (b) bandwidth 
fluctuation due to relatively high bandwidth layers, (c) de- 
structive interference of independent join/leave operations 
of each receiver, (d) priority-based delivery is not guar- 
anteed (especially during the join experiments of its own 
or other receivers), (e) process overhead of decomposition 
and resynchronization of the video stream at the end hosts, 

1 Join operations arc relatively fast if the local multicast router has al- 
ready joined the group; otherwise, join operations are expensive taking 
up to one round-trip time lo the sender's domain or a special domain 
called root domain [13]. Leave latencies are also relatively high. In 
IGMP version 1, leave operations take a couple of minutes. In version 
2, it is configurable using the max response time field in the order of 
100 msec. 



(f) unnecessary coupling of video encoding and rate adap- 
tation mechanisms (e.g. the number of layers and their 
bandwidths are determined by video encoding scheme), 

(g) overhead of shared learning [3] in terms of message 
exchange and state management, and (h) congestion in- 
formation acquired from stored learning is often incor- 
rect and misleads the decision of each receiver. Points (a), 
(b), (c), and (d) are related to the problem of layering, and 
points (e), (f), (g) and (h) arc from pushing most function- 
ality to the application. 

Some researchers [8], [9], [10] tried to alleviate the 
problem incurred by independent join/leave operations via 
synchronizing the receiver actions (point c) [8], [9], or 
structuring and sharing the rate adaptation history of indi- 
vidual receiver (point g, h) [10]. Wu et al. [9] proposed us- 
ing small equal bandwidth layers called 'ThinStreams" in 
order to alleviate the bandwidth fluctuation problem (point 
b) and to decouple control and video encoding mechanism 
(point f). However, we claim that incremental enhance- 
ments cannot solve all the problems of layered multicast 
approach since some of the problems are inherent to the 
'layering mechanism* itself. In summary, layering satis- 
fies the second goal of the desired system, but is inefficient 
achieving the first goal and does not address the third point. 

In related work, we proposed an adaptive transport pro- 
tocol called HPF, which provides end-to-end unicast trans- 
port for heterogeneous packet flows [12]. The basic idea is 
that rather than using discrete layers, the sender interleaves 
packets with different priority and reliability requirements 
in a single transport layer stream and enables the receiver 
to receive as much high priority packets as the connection 
quality can sustain. HPF allows an application to specify 
per-frame policies for reliability, priority, and deadline re- 
quirements, and implements the end-to-end framing, flow 
control, error control, and congestion control mechanisms 
for the heterogeneous data stream. 

The MHPF architecture proposed in this paper extends 
the idea of HPF to the multicast domain. Specifically, 
MHPF adopts the HPF transport protocol as an end-to-end 
mechanism for multicasting heterogenous data streams 
and introduces special mechanisms, such as priority-based 
filtering, in the MHPF servers to support efficient trans- 
mission of heterogeneous data streams in the network. As 
a result, MHPF can adapt its sending rate at a much finer 
grain to the network dynamics than the layering mecha- 
nisms can. 

By the design choice of the MHPF architecture, we do 
not require 'layering' in order to support efficient multicas- 
ting of prioritized multimedia data streams; hence MHPF 
does not suffer from the problems of layered multicast 
approach listed above. However, this comes at the price 
of establishing an overlay network for enabling applica- 
tion level multicast with specialized functionality to sup- 
port multimedia streams. We explore the trade-offs of the 



improved service model of MHPF and its computational 
complexity later in the paper. 

III. MHPF Overview 

Let us now look at the service model of MHPF, and the 
mechanisms needed to achieve the service model. To re- 
capitulate, the target applications of MHPF are multi-party 
multimedia applications that transmit multi-resolution data 
streams with different reliability and priority requirements. 

A, Service Model 

The service model for heterogeneous data stream mul- 
ticast needs to address two aspects: reliability semantics, 
and synchronization among receivers. 

In terms of reliability, strict TCP-like semantics of guar- 
anteeing sequencing and reliability for all packets is nei- 
ther necessary nor efficient for typical multicast multime- 
dia traffic. However, the heterogeneous nature of the mul- 
timedia stream requires some portion of the data to be de- 
livered reliably at the receivers, e.g. control packets in an 
MPEG stream. 

Providing semi-reliable semantics can be done in two 
ways: deliver every packet with a certain probability, or 
guarantee reliable delivery for certain designated packets 
and best-effort delivery for the other packets with prefer- 
ential delivery of higher priority packets. We choose the 
latter reliability semantics for MHPF, which we call par- 
tial reliability. 

In order to support interactive multimedia applications, 
multicast transport protocols need to provide loose syn- 
chronization among the receivers in the same multicast 
group. However, slow receivers cannot keep phase with 
faster receivers without losing more packets. With the par- 
tial reliability semantics, our service model mandates the 
preferential dropping of low priority best-effort packets to 
enable MHPF to provide receiver synchronization. Thus, 
the receiver of an MHPF stream will be provided the ab- 
straction of a single sequenced data stream, possibly with 
holes corresponding to the dropped unreliable low priority 
packets. 

In summary, the service model of MHPF provides par- 
tial reliability with preferential delivery of higher priority 
packets if the connection quality cannot sustain the entire 
data stream, adapting at a fine grain to the network dy- 
namics, sequencing of all packets, and loose synchroniza- 
tion among receivers. 

Now the question is: what are the minimum mecha- 
nisms to achieve the MHPF service model? As we men- 
tioned in Section EE, HPF provides the end-to-end mech- 
anisms for the partial reliability (i.e. interleaving of reli- 
able and unreliable packets) with flow control, error con- 
trol, and congestion control for unicast communications. 
However, the MHPF service model cannot be achieved us- 
ing only the end-to-end mechanisms of HPF. Specifically, 



we want each receiver to receive as much data as its con- 
nection can sustain - this means that the sender must be 
informed of the capacity of the fastest path in the multicast 
tree and adapt its sending rate to the fastest path capac- 
ity. In addition, for those paths lhat cannot sustain the full 
data rate, the MHPF service model requires the delivery 
of the highest priority portion of the heterogeneous stream 
that can be sustained on the paths - this means that MHPF 
servers in the network must have the ability totter data 
packets based on priority. 

In the next section, we identify what are the minimum 
network level mechanisms required at Ihe MHPF servers 
to achieve the desired service model, 

B. Design Choices for Multicast Multimedia Support 

We now consider the design choices using Figure 1 as 
a reference. In the figure, there is one sender S and four 
receivers J?i - R*. The available link bandwidths on each 
link are shown. The multicast data stream has an aggregate 
rate of 3 Mbps, with 0. 1 Mbps of priority 0 reliable traffic, 
0.7 Mbps of priority 1 traffic, 1.0 Mbps of priority 2 traffic, 
and 1.2 Mbps of priority 3 traffic (where priority level 0 is 
the highest and 3 is the lowest). 




| Multicast tree abstraction 




Fig. 1. Example configuration 



Ideally, we want the network to provide the following 
services: 

o Each receiver must receive the data stream at the maxi- 
mum rate that it can sustain at any time. In Figure 1, R\ 
and R2 must receive 3 Mbps, R$ must receive 2 Mbps, and 
JS4 must receive 1 Mbps of the heterogeneous data stream. 

• Each receiver must preferentially receive higher prior- 
ity packets over lower priority packets. For example, R4 
must receive 0.1 Mbps of the priority 0 traffic, 0.7 Mbps 
of priority 1 traffic, and 0.2 Mbps of priority 2 traffic. 

• The network should be utilized efficiently. Specifically, 
a packet should not be forwarded to a multicast router un- 
less there is at least one receiver in its subtree that will re- 
ceive the packet. For example, the rate on tunnel T 2 should 



be 2 Mbps even though the tunnel T 2 can sustain 3 Mbps, 
because none of the receivers in the subtree of M3 can sus- 
tain more than 2 Mbps. 

• As a corollary to the previous goal, the sender should 
transmit data at the maximum rate that can be sustained 
among all receivers. In Figure 1, 5 should transmit at 3 
Mbps. Additionally, the transmission rate of the packets 
marked reliable must be upper-bounded by the minimum 
rate among all the receivers. In Figure 1, the peak rate for 
the reliable component of the heterogeneous data stream 
can be 1 Mbps. 

One simple solution to achieve these goals would be 
to provide end-to-end feedback and priority dropping at 
the multicast routers. Unfortunately, this approach re- 
quires a maintenance of session state in multicast routers. 
To alleviate this problem, we logically replace the multi- 
cast routers in the figure with application level multicast 
servers. 

Further, for the scalability of the architecture, we do 
not want the sender to maintain per-receiver state. Thus, 
feedback from the receivers must be aggregated as it trick- 
les back up the (application level) tree, similar to ACK- 
aggregation in reliable multicast protocols [16]. Specifi- 
cally, each multicast server can notify its parent about the 
maximum and minimum sustainable rates among all re- 
ceivers in its subtree. The sender will then upper-bound the 
aggregate rate by the maximum rate feedback, and the reli- 
able rate by the minimum rate feedback. Priority dropping 
at each multicast server will ensure that if the queue on 
an outgoing link is full and congestion is detected, lower 
priority packets are preferentially dropped, and higher pri- 
ority packets get through. 

The above solution almost works, but not quite. First, 
consider tunnel T 2 in Figure 1. M A will send 3 Mbps on 
T 2 in the solution described above. However, M 3 will 
drop 1 Mbps on T 3 , and 2 Mbps on T 4 . In other words, 
Mi should have throttled the transmission rate on T 2 to 2 
Mbps in order to ensure that the network is efficiently used 
Now consider a more serious problem. The available band- 
width on T 2 is 3 Mbps. However, the outgoing link from 
Mi for T 2 has 4 Mbps. Thus, M\ will pump packets out 
on T 2 at a rate of 4 Mbps, and packets will be dropped 
in X\. Since we cannot assume any special mechanisms 
in Xi, the dropping policy at X x is most likely tail-drop, 
and lower priority packets are not preferentially dropped. 
Thus, i?3 and R4 will receive 2 Mbps and 1 Mbps, respec- 
tively, but not necessarily the highest priority data. 

Both the problems described above can be solved only 
with some kind of rate adaptation in the overlay tree (sim- 
ilar to the backpressure mechanisms described in [17], 
[18]). For example, Mi must estimate the sustainable rate 
on T 2 , and also get feedback about the maximum rate that 
M 3 can sustain. Mi must then throttle the transmission 
rate on T 2 to the minimum of these values. It is straight- 
forward to see that with a combination of end-to-end feed- 



back, priority dropping, and rate adaptation at both the 
multicast servers and the end host, all the goals of network 
level mechanism are achieved. This clearly imposes a con- 
siderable amount of overhead on the MHPF design com- 
pared to layered multicast approach, but must be traded-off 
against the much improved service model 

IV. Design of the MHPF Architecture 

In this section, we first present an overview of the 
MHPF architecture using an example. We then present the 
details of the architecture with the emphasis on the rate 
adaptation mechanism. 

Briefly, MHPF works as follows: For each session, 
MHPF abstracts a multicast tree T composed only of 
MHPF servers and multicast tunnels between them, gen- 
erated by the IP multicast routing. When an application 
sends a frame, it specifies the reliability and priority pa- 
rameters for the frame. The HPF protocol at the sender 
converts the frame into a sequence of one or more packets, 
all with the same parameters, then queues the packets for 
transmission in a single heterogeneous data stream. The 
MHPF servers implement a specialized packet forwarding 
behavior, so that only the highest priority packets that can 
be accommodated on a path downstream are transmitted 
along the path. 

Each receiver periodically generates rate feedback at the 
prescribed time granularity called epoch. The rate feed- 
back contains the number of packets received in the last 
epoch, which gives information on the connection quality 
that each receiver perceives. The feedback from the re- 
ceivers travels upstream along the same multicast tree T 
but in the opposite direction, and gets aggregated at the 
MHPF servers. 

The feedback aggregation is done in such a way that 
when the feedback reaches the sender, it contains the in- 
formation on the available bandwidth on the fastest and the 
slowest paths in the multicast tree. The HPF protocol at the 
sender controls the sending rate of the aggregate heteroge- 
neous stream by the fastest path feedback, and the reliable 
data rate by the slowest path feedback. The MHPF servers 
also use the rate feedback in order to provide network level 
rate adaptation as described in Section ni-B. 

MHPF saver 
multicast tunnel 

r \ priority-drop queue token bucket ... 

drop 

Fig, 2. MHPF server and tunnel abstraction 

Figure 2 presents an abstraction of the MHPF server 
functionality. Basically, it performs rate control on the 
downstream multicast tunnel using a token bucket, with 
priority-drop buffer management policy. In the example, a 



1 Variables at Hie tunnel source i 

2 pi II sending rate of i in the current epoch 

3 p g II sending rate of parent tunnel source g in ihe current epoch 

4 ptf II new downstream (estimated) bottleneck rale for next epoch 

5 sent // number of packets sent on i in the current epoch 

6 recv It number of packets received by the receiver connected via 

7 // the fastest path in the subtree in the current epoch 

8 At tiic start or each epoch 

9 The tunnel source of parent MHPF server g informs its sending rate 

10 The tunnel source i updates p g 

1 1 After this update, the tunnel source i performs 

12 if (p 9 < pi - a) 

13 increase-constant 0 

14 else 

15 incrcasc_constant <— 1 

16 At the end of each epoch 

17 The tunnel source i receives minjrtcv, mtixjrecv and ps from the child 

18 recv +- maxjrecv 

19 if (recv = sent) II linear increase 

20 pi 4- min(pN , p, -f increase-constant) 

21 else // multiplicative decrease 

22 pi 4- min{p Nt pi x 0.5) 



Fig. 4. Pseudo code of the rate adaptation at the MHPF server 

low priority packet in the queue is dropped to accommo- 
date an incoming high priority packet. Figure 3 illustrates 
an example of the interaction of the MHPF servers and the 
end hosts, and the protocol structure of the MHPF archi- 
tecture. 

At this point, the high level architecture of MHPF 
should be fairly clear to the reader. We now present the 
design details of each component of the MHPF architec- 
ture: network level rate adaptation, priority-based packet 
dropping. 

A Rate Adaptation Mechanism 

Rate adaptation occurs in MHPF over discrete periods 
of time called epoclis on each downstream multicast tun- 
nel, For simplicity, we assume that the rate adaptation is 
performed for each multicast session in this section. Later 
we relax this condition and describe how rate adaptation is 
performed for aggregate multicast sessions on the tunnel. 

When the sender transmits a packet, it inserts the cur- 
rent epoch id in the data packet. Each receiver maintains a 
count of the number of packets it received corresponding 
to each epoch. When the receiver receives a data packet 
with a new epoch id, it generates a rate feedback contain- 
ing maxjrecv, min-recv, epoch* where both maxjrecv and 
miiurecv contain the number of received packets during 
the last epoch. The rate feedback is propagated back to the 
sender along the reverse direction of the multicast tree T. 

When an MHPF server receives rate feedback on one of 
its downstream tunnel, it performs the rate control on the 
tunnel, according to the rate adaptation algorithm shown 
in Figure 4 (see also Figure 5). Essentially the rate adapta- 
tion algorithm is the popular congestion control paradigm 
of linear increase multiplicative decrease (LIMD). The ba- 
sic idea is simple: if there was no packet loss in the last 
epoch (on the fastest path in the subtree rooted at the tun- 
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Fig. 3. A simple example of MHPF operation and the protocol structure of the MHPF architecture: Initially, the sender sends at 
the rate of 5 packets with high/medium/low/medium/low priorities. Let us assume only high priority packets arc reliable. Since 
the M 2 -4 M 3 cannot sustain the full data rate, M 2 drops a low priority packet and Ri receives only 4 packets. Likewise, R 2 
only receives 3 packets (stage I, solid lines). The rate feedback from each receiver contains the number received packets and 
is aggregated at the MHPF servers. When the sender receives the feedback, it contains the numbers of received packets by the 
fastest {max = 4) and the slowest receiver (min = 3) in the group (stage D, dashed lines). Using this information, the sender 
adjusts the aggregate data rate, which becomes 4 in the next round, as well as the reliable data rate, which is unchanged (stage 
m, dolled lines). 
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Fig. 5. Information exchange for rate adaptation 

nel source i), i.e. sent — recv, then increase the sending 
rate of the tunnel to probe for more available bandwidth in 
the subtree; otherwise, detect congestion and decrease the 
sending rate. 

The algorithm presented in Figure 4 has additional con- 
ditions so that the rate control algorithm enables the tun- 
nel source to quickly synchronize its sending rate with the 
downstream bottleneck sending rate 2 by taking the min- 
imum of the calculated sending rate and the estimated 
downstream bottleneck rate (Figure 4, lines 19 - 22). 

In addition, the introduction of 'increase-flag' ensures 
that the sending rate of a tunnel source i does not increase 

2 The bottleneck sending rale on a path p is the sending rate of a tunnel 
source k whose immediate downstream multicast tunnel is the bottle- 
neck on the path p. The downstream bottleneck sending rate at a tunnel 
source t is the bottleneck sending rate on the fastest path in the subtree 
of the tunnel source t. 



without bound when the bottleneck tunnel is located up- 
stream 3 (Figure 4, lines 12 - 15, line 20). The constant 
a > 0 controls the amount of bandwidth that the tunnel 
source i can increase over the estimated sending rate of 
the upstream bottleneck, and is usually set to 0. The rea- 
son for controlling the sending rate based on maxjrecv is 
because the tunnel source needs to be able to serve the re- 
ceivers connected via the fastest path in its subtree (Figure 
4, line 18). 

The state information maintained at an MHPF server M 
is as follows: parent and children in the multicast tree; 
the current sending rate of the parent (p s ); for each tun- 
nel source i: the minimum (jnin~recv) and the maximum 
(max-recv) number of packets received by the receivers in 
its subtree in the last epoch, the number of packets sent 
on the tunnel in the last epoch (sent), current sending rate 
on the tunnel (#), and current sending rate of the child 
(pw). The sending rate of the parent tunnel source p Q is 
informed by the parent MHPF server when it updates the 
sending rate of the tunnel. The MHPF server learns the 
other information from the feedback from its children. 

For the rate feedback aggregation, the MHPF server cal- 
culates the minimum of all the miti-recv values and the 
maximum of all the maxjrecv values from its children and 

3 In other words, the bottleneck sending rale on the path from the 
sender to the mnnel source i is smaller than the downstream bottleneck 
sending rate of t. 



forwards this information upstream, along with the maxi- 
mum of the sending rates on the downstream tunnels, i.e. 
p M = max{pi}, y ii 6 downstream tunnel sources. After 
the aggregation, the minjrecv and maxj-ecv values contain 
the number of received packets (during the last epoch) by 
the receivers connected via the slowest and the fastest path 
in the subtree (rooted at the MHPF server i)» respectively. 
This provides the information for rate adaptation proce- 
dure performed by the sender and the upstream MHPF 
servers. 



B. Partial Reliability Mechanism 

In order to achieve the partial reliability semantics (with 
receiver synchronization), the sender needs to control the 
sending rate of the aggregate data stream (ps) according 
to the fastest path rate, and the reliable data rate (pn) ac- 
cording to the slowest path rate in the multicast tree. The 
rationale for the former is to exploit the bandwidth on the 
fastest path. The rationale for the latter is to avoid reliable 
packet drops, which will trigger packet retransmission, 
hence degrades the performance. In addition, we want to 
preserve the sequencing of interleaved reliable and unreli- 
able packets. In this section, we describe the end-to-end 
rate adaptation performed by ihe sender, which achieves 
the desired partial reliability semantics. 

As a result of the feedback aggregation, the rate feed- 
back received by the sender contains the numbers of the 
packets received by the receivers connected via the slow- 
est path and the fastest path in the multicast group, and the 
current sending rate of the immediate downstream MHPF 
server. Note that the feedback is generated periodically 
and the sender will see only one feedback per epoch, irre- 
spective of the multicast group size. 

When the sender receives the feedback, it performs 
LIMD rate control for both ps and pn according to 
max-recv and minjrecv values, respectively, Once the 
new rates have been calculated, the sender controls the 
sending rate of the heterogeneous data stream using a dual 
token bucket mechanism described as follows. 

Briefly, the sender maintains two token buckets: one 
for R-tokens generated at rate pR, and the other for A- 
tokens generated at pa- When the sender transmits a reli- 
able packet it needs to acquire both R-token and A-token. 
On the other hand, when it transmits an unreliable packet, 
it only requires an A-token. It is straightforward that this 
dual token bucket mechanism upper-bounds the aggregate 
data rate by pa* and the reliable data rate by pr. In ad- 
dition, it preserves the sequence of the interleaved reliable 
and unreliable packets. For example, even if there are A- 
tokens in ihe bucket, if the preceding reliable packet is 
waiting for an R-token, the unreliable packet behind the 
reliable packet must wait also. 



C. Priority-based Packet Dropping 

In Section HI-B, we have identified that priority-based 
packet drop mechanism is one of the major components to 
achieve the goals of the MHPF service model. Essentially 
the priority dropping mechanism ensures that when the in- 
coming data rate to the packet queue is greater than the 
outgoing data rate, only the highest priority data packets 
that are sustainable by the outgoing rate get forwarded by 
preferentially dropping low priority packets. For simplic- 
ity, we assume that each tunnel source maintains separate 
priority-drop queues for individual sessions. We will relax 
this assumption in the next section. 

1 VarinbEcs for session i 

2 hi II the number of packets in the buffer for session i 

3 Bi If the buffer bound for the session t 

A For incoming packet p 

5 pri p.priority 

6 if (hi = Bi ) //if the queue is full, then perform priority dropping 

7 p* *- findJowcst-priority -packet (i) 

8 if(p / <prt) 

9 drop Cp'), then enqueue (p) 

10 else 

1 1 drop ip) 

12 else 

13 cnqucuc(p) 

Fig. 6. Pseudo code of the priority drop algorithm 

The algorithm for priority dropping is*quite straightfor- 
ward (Figure 6). When the queue receives a new packet 
p, if the queue is full, then it searches for the lowest pri- 
ority packet already in the queue. Then it compares the 
priority of the selected packet, say p', with the new packet 
p. If the priority of p is higher than that of p', then p' is 
dropped from the queue and p is enqueued. Otherwise, p 
is dropped. 

Priority dropping involves fairly simple operations, and 
we typically anticipate only 4 priority levels in prac- 
tice. However, the management of the per-session priority 
drop queue obviously incurs excessive overhead to MHPF 
servers as well as the per-session rate control. Hence we 
need to implement a smart session aggregation mechanism 
in order to alleviate the computational complexity at the 
MHPF servers. 

£>. Practical Issues 

In this paper, we have thus far described all the mecha- 
nisms with respect to a single multicast session. Although 
we need to maintain state information and perform feed- 
back aggregation on per-session basis, it will become im- 
practical for MHPF servers to provide rate adaptation and 
priority dropping for each multicast session as the number 
of on-going multicast sessions increases. We now briefly 
outline a computationally simple approximation that al- 
lows the tunnel sources of MHPF servers to rate control 
and perform priority-drop for each tunnel rather than for 
each session. 



The basic idea of session aggregation is die following: 
when there are multiple on-going sessions on a single tun- 
nel, the tunnel source multiplexes them in a single output 
packet queue. In this case, the transmission rate on the 
tunnel is set to the sum of the data rates of all the ses- 
sions sharing the queue, i,e. ptunnei = Ei Pu Vi G ac- 
tive session. Likewise, the buffer bound of the queue is 
set to the sum of the buffer bounds of all the sessions, i.e. 
Btunnei = 10,i B u Vi E active sessions. We then adopt 
the dynamic queue management mechanism [19], which 
provides loose bandwidth assurances to each flow without 
performing per-flow weighted fair queueing. 

We now briefly outline the dynamic queue management 
mechanism. Each flow has a buffer bound in a shared 
queue. A flow cannot enqueue a packet into the shared 
queue if both of the following conditions fail: (a) the ag- 
gregate queue size has exceeded the aggregated bound, 
i.e. Ei&i = BtunneU and (b) the flow's queue size has 
exceeded its own buffer bound, i.e. bi = Bj. Condi- 
tion (a) allows for multiplexing the shared queue while 
condition (b) provides a minimum buffer allocation in the 
shared queue to each flow. It has been shown this simple 
multiplexing mechanism enables both efficient multiplex- 
ing with minimal state management while providing long- 
term bandwidth assurance for each flow. In our context, 
we approximate the rate adaptation mechanism simply by 
changing the buffer bound for the flow in the shared FIFO 
priority-dropping queue, and updating the aggregate send- 
ing rate on the tunnel. 

Now we present the simulation-based performance re- 
sults of MHPF to illustrate how our design achieves the 
desired service model. 

V. Performance Results 

We have instantiated the MHPF architecture in a labora- 
tory testbed. The HPF protocol and the MHPF server func- 
tionality are implemented in the Linux 2.0.x kernel (Figure 
7). 

In addition to the implementation, we have been test- 
ing the performance of MHPF in comparison with other 
multicast protocols using ns simulator. While the testbed 
environment allows us to evaluate the protocol in a real 
network, the network configuration is limited in size. On 
the other hand, the simulation environment allows us to 
compare the performance of MHPF with other multicast 
protocols in a more controlled environment with various 
network configurations. In all our simulations, we have co- 
located MHPF servers with multicast routers for simplicity 
of presentation. In the rest of the section, we used the term 
"multicast router" and "MHPF server" interchangeably. 

In this section, we present a set of simulation-based test 
results of MHPF in comparison with RLM (receiver-driven 
layered multicast). We chose RLM as a reference system 
since it is the most well-known layered multicasting ap- 
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Fig. 7. The structure of the MHPF server implementation: on 
the forward path, packets are enqueued in the priority-drop 
queue of the custom traffic shaper associated with the out- 
going tunnel. On the reverse path, the feedback is captured 
by BSD packet filler and pushed up to the stale manager 
which performs feedback aggregation and rate adaptation. 
The original feedback is denied by firewalls and the aggre- 
gated feedback is forwarded via raw socket. 

proach to date. Performance comparison with the later en- 
hancements to RLM [8], [9],[10] is an on-going work. 

We first consider a simple network topology shown in 
Figure 8. The available bandwidth is annotated on each 
link, and the one-way link latency was set to 10 msec 
for all the links. The operation parameters of RLM were 
adopted from [3]. We used 2.2 Mbps synthetic heteroge- 
neous CBR flow with three priority levels with the ratio 
of high:medium:low at 1:2:3. In case of RLM, the sender 
transmits three individual CBR flows with 367 Kbps, 733 
Kbps, and 1.1 Mbps for high, medium, low priorities. We 
measured the performance of each protocol for 100 sec- 
onds. 
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Fig. 8. A simple two-level topology 

We first tested three cases: (a) steady state performance 
with no congestion, (b) adaptation to a long-term conges- 
tion, and (c) adaptation to a series of short-term conges- 
tions. We then present the effect of partial reliability, the 
session aggregation, and the performance in a larger net- 
work in the following tests. We omit other results, e.g. 
multiple congestion case, coexistence with TCP, due to the 
space constraints. Table I summarizes the results of the 
first five tests. 

• Test L Steady State Performance In the first test, we 
compare the adaptation of each protocol to the heterogene- 
ity of the network. The rate adaptation implementation of 
MHPF in closely simulates the behavior of fluid model, 
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TABLE I 

Effective throughput performance of MHPF and RLM at each receiver: The first four columns show the performance of MHPF, 
the next four columns show thai of RLM, and the last four columns show the relative performance of MHPF over RLM. 



thus we observe its LIMD-based rate adaptation achieves 
about 75 % of the path bandwidth for all the receivers. In 
practice, due to the buffering in network routers, we ex- 
pect the bandwidth utilization of MHPF to be higher than 
75 %. In case of RLM, we observed variable performance: 
more than 80 % bandwidth utilization for Ri and #3, but 
only around 50 % for i?2- Overall, we find both protocols 
achieve the goal of preferential delivery of high priority 
packets. 

• Test 2. Adaptation to Long-term Congestion For this test, 
we instantiated 1 Mbps CBR flow on the link M3 -+ 
during 30 - 50 second period In ideal case, we expect that 
only -R4 sees a temporary performance degradation, which 
is true for both protocols. However, there is a difference in 
the degree of degradation. In MHPF case, the throughput 
decrease is only around 9 % whereas the decrease of RLM 
is more than 42 % compared with Test 1. 

• Test 3. Series of Short-term Congestions Now we com- 
pare the performance when there is a series of short-term 
congestions. On the link M 3 Aj, we instantiated 2 
Mbps CBR flows during 30 - 3 1, 40-41, 50 - 51, and 60 
- 61 second time windows. As in the previous case, we 
expect only R4 sees performance degradation. This holds 
for MHPF. However, in RLM, there is performance degra- 
dation on Ru but we do not have explanation on this phe- 
nomenon. Overall, the performance degradation of R4 in 
MHPF case is marginal (around 5 %) compared with Test 
1. However, with RLM, R4 sees around 40 % decrease 
even with short-term congestions. 

Essentially the performance of MHPF in this case is in be- 
tween that of Test 1 and Test 2, whereas the performance 
of RLM degrades to that of Test 2 (the long-term conges- 



tion case). The poor performance of RLM in Tests 2 and 3 
may be due to the coarse time grain rate adaptation based 
on join/leave operations and the exponential timer backoff 
after failed join experiments, which imposes potentially a 
long delay before rejoining the layer after the congestion 
has cleared out. From this experiment, we can deduce that 
the performance of RLM will become aggravated with the 
increase of the level of network dynamics, 
o Test 4. Partial Reliability In (his test, we show the im- 
pact of the partial reliability semantics (with receiver syn- 
chronization) of MHPF, According to the MHPF service 
model, if the slowest receiver in the multicast group is tem- 
porarily slower than the reliable data rate, then the entire 
multicast session will slow down in order to ensure the de- 
livery of the reliable data to the slowest receiver. We set 
the high priority packet to be reliable and start a conges- 
tion on link M 2 -f Ru which reduces the bandwidth by 
0.7 Mbps for 5 seconds (and causes a small number of high 
priority packet losses), As a result, we observe not only R\ 
but also all (he other receivers experience the throughput 
degradation. However, note that MHPF is capable of ter- 
minating connection to excessively slow receivers in order 
to maintain the desired session progression of the group. 
• Test 5. Effect of Aggregation We now briefly illustrate 
the effect of the aggregation using a simple scenario with 
two multicast sessions sharing the same multicast tunnel. 
As shown in Figure 8, we instantiated another multicast 
session going through Mi and M3 with the influx rate of 
5 Mbps at Mi. We assigned the same buffer bound to 
both multicast sessions, thus expect to observe approxi- 
mately the same throughput for both sessions. The effec- 
tive throughput of the original multicast session was mea- 
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sured as 1 .389 Mbps, and the new session was 1 .289 Mbps. 
In other words, the approximation achieved by the session 
aggregation mechanism is quite effective (less than 4 % 
difference in the overall throughput) at least in this sim- 
ple configuration. We have more results illustrating the 
effectiveness of dynamic queue management mechanism 
but we do not present them due to the space constraints. 
• Test 6. Performance in a Larger Network Topology 




Fig. 9. An eight receiver configuration 



In this test, we briefly present the performance of MHPF 
and RLM in a larger network topology shown in Figure 9. 
The effective data rate (Mbps) at each receiver in a steady 
state is summarized in table H The major observation in 
this case is that the performance of RLM has decreased for 
all the receivers and the bandwidth utilizations are only 50 

- 70 % . This must be resulting from the uncoordinated 
join/leave operations of the individual receivers since now 
there are more receivers sharing the same multicast router. 
On the other hand, we observe that the performance of 
MHPF is essentially the same compared with the previous 
test cases. On the average, MHPF still achieves around 70 

- 75 % of the available bandwidth. 

We tested out the same test in a larger multicast group 
of 16 receivers organized in a 4-Ievel binary tree, and 
observed essentially the same trend: as the number of 
receiver increases the performance of RLM degrades 
whereas the performance of MHPF remains almost the 
same. 

There can be several reasons for the progressively poor 
performance of RLM with the increase of the number of 
receivers and the network dynamics: (a) coarse grain adap- 
tation, (b) interference of join/leave operations of indi- 
vidual receivers, and (c) incorrect congestion information 
propagated and maintained by the shared learning mecha- 
nism. By design choice, MHPF does not suffer from any 
of these problems. In summary, we observed that MHPF 
better utilizes the network bandwidth, effectively adapts 
to the network dynamics, and scales better in terms of the 



multicast group size compared to the state of the art lay- 
ered multicast approach RLM. 

VI. Summary and Review of Trade-offs 

In this paper, we have presented a new approach for sup- 
porting multicast communications for the multimedia ap- 
plications in the Internet environment. Unlike most related 
work, wherein layering mechanism is popularly used, we 
have proposed a multicast architecture where the multime- 
dia data stream is processed as a single heterogeneous data 
stream and appropriate filtering mechanisms in the MHPF 
servers ensure the reception of highest priority portion of 
the data stream that can be sustained on each path of the 
multicast tree. 

Having presented the design and performance of MHPF 
in previous sections, we now discuss the trade-offs in 
adopting the MHPF approach as opposed to conventional 
layering. 

« Granularity of adaptation There are two dimensions to 
the granularity of adaptation: how long does it take to react 
to the network dynamics, and how fine is the adaptation in 
terms of bandwidth. The strength of MHPF is that it is 
highly adaptive along both dimensions. 
In layering, a receiver determines when to join or leave 
multicast layers based on congestion estimation. Estimat- 
ing congestion takes time; furthermore, joining/leaving 
layers is accomplished through IGMP, which are not op- 
timized for reducing latencies. Consequently, layering 
approaches take longer to recover from short-term band- 
width fluctuations. In MHPF, rate adaptation takes place 
through a backpressure mechanism that takes effect within 
one epoch. Thus, MHPF is able to quickly recover from 
short-term rate fluctuations. Furthermore, because of the 
priority dropping mechanism, higher priority packets are 
not affected during adaptation of lower priority compo- 
nents unlike layering. 

Even more serious point is the granularity of the band- 
width of rate adaptation. In layering, the sender pre- 
determines the granularity of rate adaptation by the num- 
ber of layers into which it decomposes its heterogeneous 
data stream. Of course, each layer requires an independent 
multicast address - thus the sender is constrained to keep 
the decomposition coarse grain. On the other hand, each 
tunnel in MHPF is abstracted as a token bucket with pri- 
ority drop queue - the efflux rate of the leaky bucket can 
be adjusted in a fine grain. In summary, MHPF adapts rate 
at the granularity of a packet, while layering approaches 
adapt rate at the granularity of a layer. 

. Choosing which component of the lieterogeneous stream 
to receive In MHPF, the sender determines the relative pri- 
ority levels between the different interleaved sub-streams 
of a heterogeneous data stream, and all receivers must ad- 



here to this notion of priority. A perfect application for this 
scheme is a multi-resolution video stream, where lower 
frequency or DC components have higher priority than 
higher frequency detail. In this case, all receivers have the 
same notion of priority, since it makes no sense receiving 
a lower priority packet at the expense of losing a higher 
priority packet. 

However, in a heterogeneous data stream that is composed 
of interleaved audio, video and text data suh-streams, some 
receivers may prefer to receive audio with higher priority 
over video while other receivers may prefer video over au- 
dio. In this case, it is the receiver which must determine 
the priority of packets within a stream. MHPF is not well 
suited for this scenario, while layering allows the receivers 
the flexibility to choose their own layers. In summary, in 
MHPF the sender imposes the priority among data packets 
uniformly for all receivers, while in layering each receiver 
has the ability to establish its own priority scheme inde- 
pendendy. While the latter is certainly more flexible, we 
believe that the MHPF model is adequate and well-suited 
for multi-resolution coding. 

• Deployment and Complexity We compare MHPF and 
layering approaches in terms of computational complex- 
ity in the network and the end hosts, stale requirements in 
the network, and ease of deployment. 
In terms of computational complexity, layering moves the 
complexity within the network down to the routing layer 
(joins/leave operations), and to the end host (decomposi- 
tion/resynchronization of data stream). Most layering ap- 
proaches do not address the issue of reliable delivery. In 
MHPF, the complexity is spread out among the compo- 
nents - sender, receivers, and MHPF servers. The sender 
and receivers participate in per-session tasks such as flow 
control, partial reliability, and interleaving heterogeneous 
sub-streams. The MHPF server performs three tasks - rate 
adaptation and priority drop for each tunnel, and feedback 
aggregation for each session. 

The key issue is whether the improved service model of 
MHPF is worth the computational complexity. For an opti- 
mized implementation, feedback aggregation occurs once 
every epoch (200 msec in our configuration) and requires 
between 50 to 100 instructions depending on the control 
path; priority dropping is per-tunnel and requires between 
4 to 8 additional instructions (over tail dropping) for a pri- 
ority queue with 4 levels; and rate adaptation requires ap- 
proximately 20 instructions per-epoch per-tunnel. As we 
can sec, the computational complexity for each of the com- 
ponents is somewhat reasonable. The main source of over- 
head comes from maintaining a traffic shaper for each tun- 
nel (though traffic shaping is now a part of the standard dis- 
tribution of most kernels, including our deployment plat- 
form of Linux 2.0.x). We are working on a tunnel-level 
window based flow control scheme that eliminates the traf- 



fic shaping requirement In summary, the processing over- 
head of MHPF is reasonable, and the fact that MHPF can 
be implemented using per-tunnel rather than per-session 
processing (except for reliability, which is an orthogonal 
issue) enables MHPF to scale well. 
In terms of deployment, layering has the immense advan- 
tage of not requiring any change to the existing multicast 
servers. On the other hand, MHPF requires installing up- 
graded and specialized multicast servers, and thus requires 
some infrastructure change for deployment. In the long 
run, we believe that it is worth the trade-off because of the 
improved service model. 

In conclusion, the issue of MHPF versus layering is re- 
ally an issue of filtering versus layering. MHPF promotes 
the idea of sending a single heterogeneous data stream, and 
filtering through the highest priority component of the het- 
erogeneous stream along each path of the multicast tree 
so that each receiver receives the most important data at 
the rate that its path can sustain. Additionally, MHPF 
provides mechanisms for reliable delivery of packets so 
marked by the sender, and supports end-to-end rate con- 
trol of the interleaved reliable and unreliable sub-streams. 
MHPF offers a superior service model and enhanced per- 
formance for multimedia multicast communications at the 
expense of requiring smarts within the network. Irrespec- 
tive of the practical issues of deployment, we believe that 
exploring this approach as a viable technical alternative to 
layering and understanding the trade-offs between the two 
approaches is important, and preliminary evaluations seem 
to indicate that MHPF is able to provide improved service 
at acceptable overhead. 
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Abstract 

This report examines potential stability problems associated with a model 
based apporach to TCP-friendly flow control for non-TCP traffic. In specific, 
such an approach involves using a TCP- friendly formula that estimates the 
throughput of a TCP session with the same end-to-end traffic characteristics as 
the non-TCP connection under consideration. The inputs to this formula in- 
clude the round trip time, the timeout value, and the packet loss fractiou of the 
connection, This paper shows that estimating the loss fraction per transmitted 
packet highly depends on the current transmission rate of the connection as 
well as the actual loss fraction of the path. Thus the estimated loss fraction 
can contain errors which result iu inaccurate estimation of the corresponding 
TCP throughput. This inaccuracy can push the transmission rate of the non- 
TCP connection away from the fair share of the bottleneck bandwidth on the 
eud-to-end path, so that under steady slate, the connection ends up receiving 
either over-allocation or under-allo cation of bandwidth. 

1 Introduction 

Congestion control is an integral part of any best-effort Internet data transport pro- 
tocol. It is widely accepted that the congestion avoidance mechanisms employed in 
TCP [1] have been one of the key contributors to the success of the Internet. A 
conforming TCP flow is expected to respond to congestion indication (e.g M packet 
loss) by drastically reducing its transmission rate and by slowly increasing its rate 
during steady state. This congestion control mechanism encourages the fair sharing 
of a congested link among multiple competing TCP flows. A data flow is said to be 
TCP-friendly if at steady state, it uses no more bandwidth than a conforming TCP 
connection running under comparable conditions. 
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Recently we have seen several efforts to develop a stochastic model of TCP con- 
gestion control that gives a simple analytical Formula Tor the throughput of a TCP 
sender as a function of packet loss and round trip time {RTT) (2, 7, 3]. These efforts 
are propelled by the interests in using the formula For TCP-friendly flow control oFa 
non-TCP flow such as UDP traffic [6, 8] and reliable multicast [5]. Typically, these 
flow control schemes work as follows. A receiver monitors packet loss rates and round 
trip delays, and using a TCP Friendly Formula, it estimates the throughput oFa TCP 
connection running under the same operating conditions. The estimated throughput 
is sent as feedback to the sender. If the feedback throughput is less than or equal to 
the current transmission rate of the non-TCP flow, then the sender sets its rate to 
the feedback throughput. Otherwise, it increases its rate. 

The stochastic model used by Padhye et al. [3] makes the TCP throughput esti- 
mation based on the following assumptions: 

o When a packet is lost, all subsequent packets in the same RTT round are lost. 

o The probability that a packet is lost in an RTT round, given that no previous 
packet in the same round is lost is independent of packet loss in earlier rounds. 
Call this probability l nct > 

To measure t act of a TCP flow under observation, Padhye et al.[3] counts the 
number of TCP loss indictions (triple duplicate acknowledgements, and timeouts) over 
a certain period, and divides the result by the total number of packets transmitted by 
TCP over that duration. This is an approximation. Let l m , be the expected value of 
the resulting value. t app is used as input to their formula to estimate the throughput 
of a TCP connection under observation. 

When a non-TCP flow is regulated using the formula, it is not possible to compute 
l ttpp since the flow may use a different window size. Instead, Hand ley et al. [5] and 
Padhye et al.|4] estimate l act on the non-TCP flow by dividing the number of loss 
events by the total number of packets transmitted. Loss events are registered as 
follows. The first packet loss is counted as a loss event. Following this, there is a 
back-off for the duration of an RTT during which no packet loss is counted. The next 
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packet loss after this back-off is counted as a loss event, followed by another RTT 
back-olf, and so forth. Let the expected value of this be Z cjrl . 

Let B be the total bandwidth shared among n TCP sessions, and one non-TCP 
flow running on the same end to end path. The objective of TCP-friendly algorithms 
is to ensure that the rate /z of the non-TCP flow is maintained at the fair share 
given by £/«,> = ^ in steady-state. The known TCP-friendly algorithms achieve 
this by a feedback mechanism. Specifically, an estimate of the throughput on each 
of the 71 TCP sessions is fed back to the sender which then uses this to regulate its 
transmitting rate \i. Assuming that the TCP sessions share the residual bandwidth 
equally, the steady-state throughput of each TCP session is BzlL. Ideally, this is the 
parameter which must be fed back to the sender. It can be seen that when /x > B Inir , 
< Bj airi and when /x < B Jairi > B !air . 

However, as discussed earlier, the TCP throughput estimate is calculated based on 
the loss fraction of packets on the non-TCP flow. This obviously means that errors 
in estimating the loss fraction could lead to errors in estimating TCP throughput. 
In the following sections, we show that this indeed happens, and could even result in 
unfair allocation of bandwidth, in particular, we show that: 

1. When H> nff (i.e., one packet per RTT), t csi < l ac o As a result, by substi- 
tuting l cst for lad i" any formula which estimates TCP throughput based on 
Lch we obtain an erroneous estimate. This results in over-estimation of the 
throughput. 

2. When //. < B Ittir , l cst > l aP p. Thus, substituting l cst for l app in a formula estimat- 
ing TCP throughput based on l app gives us an erroneous estimate. This results 
in over-estimation of the throughput. 

3. When fi > B /ni>1 we could have i cttt < l app under certain conditions. Thus, 
substituting l esl for l app in a formula estimating TCP throughput based on l app 
gives us an erroneous estimate and uider-estimation of TCP throughput. 

4. The formula used to estimate TCP throughput, such as the one provided. in [3) t 
may itself have some inaccuracy. 



5. Depending on the starting rale of//, these errors introduced in TCP throughput 
estimation push fx further away from the fair share, rather than forcing it to 
converge to the fair share. Tims, at steady state, the non-TCP flow may end 



In Section 2, we briefly outline the TCP-friendly flow control algorithm proposed 



the sources of error in TCP throughput estimation, and show how these errors result 
in unfair allocation of bandwidth by the flow control algorithm. Section 4 contains 
numerical examples which substantiate these findings. 

2 Model-based flow control 

The first TCP throughput estimation model was proposed by Floyd [2] and Ott et 
al.(7]. Their model gives the following formula: 



where / is the loss fraction of the TCP packets, and tnrr is the round trip delay of 
an end-to-end path where TCP runs. 

I is defined jls the probability that a packet is lost given that no previous packet in 
the same round is lost (we use l ac t and I interchangeably). / is estimated by counting 
the number of loss events over a certain period and dividing the result by the number 
of packets sent. 

Padhye et al. (3] proposed an not her model which gives a better approximation 
than the earlier one. The model uses the following equation 1 : 



up receiving either over-allocation or under-al location of bandwidth. 



in [5], and the formula used in TCP throughput estimation. In Section 3, we describe 




1.22s 



(2.1) 




(2.2) 



where to is the timeout value. 



l This is only an approximation formula. The complete formula can be found in [3] 
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Let fx be the bit rate of transmission of packets over the non-TCP connection. Let 
B be the total bandwidth available between the source and a given destination. If 
there are n active TCP sessions between the source and the destination, the through- 
put of each, in steady state, is given by ^jp. 

A typical model-based flow control works as follows. Using one of the above for- 
mulae, and estimated values of /, tmrt and t 0l the receiver of the non-TCP flow 
computes a value of B(l) which is fed back to its sender. If B{1) < /ju the sender sets 
its transmission rate to £(/). Otherwise, it increases its rate by some amount. The 
receiver continues to report a new value of B{1), and the sender makes the adjustment 
of its rate accordingly. The idea is that if the feedback throughput value is accurate 
enough, the transmission rate eventually converges to the fair share of bandwidth 

<>■«■. 

3 Errors in TCP Throughput Estimation and their 
Impact on Resource Allocation 

In this section, we look at the different sources that cause errors in estimating the 
' equivalent TCP throughput and show that their potential consequence is unfair allo- 
cation of resources to the non-TCP flow. 

3.1 Error in Loss Fraction Estimation 

Let Let be the actual loss fraction, and let l tiSt be the expected value of the estimate 
measured on the non-TCP connection. Let l npp be the expected value of an estimate 
measured on one of the TCP connections. Then, we have 
THEOREM 1 

B - fi 

lest > law ,f I 1 < — — 

For proof, we compare the packets sent over the non-TCP stream over the measure- 
ment period T, to that sent over a TCP session. Let T be equal to N round trip 
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times. Let Wu x = 1, 2 T h\ be the number of packets sent in the i lh round over the 
TCP connection under observation. Therefore, the number of packets sent over the 
TCP session is given by W { . The expected number of loss events is given by : 

N Wi 

LOSSTCP = Yl IK 1 " l*ctY~ laet 

Therefore, we have: 

In the same duration, fiT packets are transmitted over the non-TCP session. The 
expected number of loss events observed here is given by: 

Thus, we have: 

Si nC( > /A < and TCP is in steady state (congestion avoidance mode), there are 
more packets transmitted on any TCP connection, than on the non-TCP connection. 
Hence, £f=t W t > Tfi. 

LEMMA 1 For any q such that Q<q<\, F m (q) = '+?+^-"+'T" 1 is a monotonically 
decreasing function m. 



PROOF : F m »(q) = q < i. 

Hence, r/ m < ^ m ~ 1 < <y m *' J < < q 1 < q < 1. 

Fm+ifa) can be rewritten as Y ' ' — ¥ ni |/ 1 1 — • 

Thus, we have: 

„ , v 14- 1/771 + r/ + 7/m + ... + <7 r/l ~* + 9"^7m _ 1-M/ + + q Tn ~ l _ „ , , 

Jr«H-l (<7) < " ~ " 1 



LEMMA 2 // G m (q) = F m ( ff ) x m = 1 + ? + q 2 + .... 4- ? m ~\ ;V = ^ then 

jrG mi {q)<NxGv{q) 
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PROOF : Consider the sequence {q"",q m \ .f/"" v }- Its arithmetic mean is given 

by AM(q) = ^'V'"' - Its geometric mean is GM(q) = q M , where M = "" ■ 
AM(q) > GM(q). Also, 0 < q < 1, so (1 - q) > 0. Therefore, < »=£Mlai. 

But = E£L, Gm,(9), and = A' x G M ( 9 ). 

We have: 

Therefore, from Lermna 2. it is seen that: 



< "" ,S NxEW N 

But A r x EW = E,*, Wi > l 1 ?- Hence, from Lemma 1, we have: 



(3.5) 



A' x GrMl - x U ; /V X G (#) (1 ~ /nct) X / 36) 
N x EW T t i y ' ' 

Recognize that the RHS in (3.6) is l est . Thus, combining (3.6) and (3.5) gives us the 

proof of Theorem 1. 



THEOREM 2 



t«i < l«w if H > -^- L > and -jL > IV,, for 1 < i < N 

From Lemma 1, it follows that < 2tM, for /: < m. Therefore, /; x G m [q) < 
m x Gjt((/). Observe that may be rewritten as : 

■, T.?LlLr.l X G(- W A')(1 - lac' .) _, v (V v G(7»,/W)(l ~ kr.l) 
U = ^ - <«c x iV x — 

Therefore, l eat x & = U x G ( ^(l - *„,,)• But $ > IV,. for 1 < i < N. 
Hence, for 1 < i < N, we have: 

Tu 

G {%)( 1 ~ x W { < G Wi {l - l acl ) x -± 
Taking the summation over 1 < i < N, we get: 



Simplifying both sides, we get: 
Or, i cs t < l BJ1Jt . 

THEOREM 3 When \.i > l cst < / flC t. 

For proof, we revisit Lemma 1 . Observe that: 

/ - ££l *«* * g(7Vx/A')(l ~ tact) . g(7WAf)(l"<«ct) _ , ,j _ , * 

test = ^ - feet x Tpi/V ~~ ('WW 1 

But £ = /?TT, and > 1. Therefore, from Lemma 1, it follows that 
Hence, l C5t < l nc t. 

3.2 Impact of Errors in Loss Fraction on TCP Throughput 
Estimation 1 

First, consider the case where TCP throughput is estimated as a function of l act . such 
as (2.2). From (2.2), it can be seen that the estimate B{1) is a decreasing function 
of /. Thus, when l cst < l acU the estimate B(l C9l ) is likely to be larger than the actual 
value of which is the correct feedback parameter. In the extreme case, we may 
have the following situation: 

Bfnir < it < fl/«rf = B{1 est) 

Consider the case where the non-TCP transmission rate ft is given by ^ + nAB, 
for some AB > 0* Now. suppose we are given a function B(l) which correctly 
estimates the throughput on a TCP connection given the value of the loss fraction 
of packets, /. Let 0{l) = > 0, for all / such that 0 < / < 1. For the value of /i 
considered, B(l) = since we have assumed that B{1) correctly estimates TCP 
throughput. Hence, 

B(l) = -5- - AS 
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Thus, when the value of \i is greater than the fair share, ^ the feedback parameter 
is less than the fair share. This facilitates reduction of the transmission rate on the 
non-TCP connection to preserve bandwidth fairness. 

However, when the feedback parameter is calculated based on l csh the estimate of 
I rather than the actual value l acii there is no such guarantee. As shown earlier, 

lest ~ ^ 

where T is the period of observation and /V is the number of round trip times in T. 
Therefore, l rM = kd - A/ < L acJ .. The feedback parameter B(l tiXl ) = B(l act - Al) « 
- AB -f f3(l ttCl )AL If AB < 21^, then we have: 

n + l /it i 

Next, we consider the case where TCP throughput is calculated using a formula 
based on / UJJJJ . Assuming we ate provided a formula that accurately estimates 
TCP throughput, we should have: 

Let j £ = JL. + n AB. But since we use /„ t instead of l appi the feedback parameter is 
given by: 

B(l,st) » -^r ~ A/? + 0(t opp )(l npp - 
n + i 

Thus, the following problems could be encountered: 

1. When fi < / c ,i > kpp- There could be a situation where B(l csi ) < /i < 

2. When // > and l cst < l appi there could be a situation where B(l csl ) > fi> 

Bfair- 

3.3 Errors in TCP Throughput Model 

In the earlier section, we considered the error due to inaccurate estimation of the loss 
fraction, assuming that the throughput model to be perfect. However, it is seen that 
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the throughput estimation based on (2.2) also introduces some error (see [3]). From 
the tables provided in [3], it was observed that the relative error approached 100% in 
certain extreme cases. In this section, we show that this error can add on to the error 
in loss fraction estimation to result in erroneous feedback to the non-TCP sender. 

Suppose the model estimates TCP's throughput to be B mod {l) = B nct +B C , where 
B ncA is the actual TCP throughput. 

Consider a link between two nodes with a minimum round trip time of T\ seconds. 
Let N x = [B act x T,J. Let J, = / - Al = I x 1 " (1 A 7/^ 1 . It is easily seen that Al > 0 
if A'j > 1. Let the bandwidth of the link under consideration be B tl}t = nB* cL + 
g„, H (M+a, n ,,,(/-AM Nf>w supp0S(5 we have a flow of rate /z = if0+ ^^ ff " A0 , sharing 
this link with n , n > 1, TCP connections. The throughput of eacli TCP session is 
tttn ^P = B acl . Using (2.2) however, we get different results. Ul is estimated correctly, 
the TCP bandwidth is estimated as: 

B mod (l) = B act + B c 

Since A/ > 0, and B mott (l) is a decreasing function of/, B mo ,t[l - A/) > B mot i{l). So, 

fi - B m „ ft {l) = ^ > 0 

In other words, \i > B ac t + B t . 

But / is not correctly estimated. The estimate of / is given by: 

- 1 - {\-iy* mr 

ast " ft x RTT 

where RTT is the mean round-trip time. Obviously, RTT > T { . Besides, \l > 

B oct + B c . Hence, from Lemma 1, it can be seen that i est < l } < I. We can write 
ksi -I - Dl where Dl > AL 

The estimate of the TCP bandwidth, using (2.2), is given by: 
B morf (U) = B mfld (l - Dl) 
As mentioned earlier, B mod (l) is a decreasing function. Therefore. 

BmodiU) = B»od{l - Dl) > B mod {l - Al) 
10 



Also. 

B mo *{k*i) - /' > S mflrf (/ - AO - fJ = > 0 

The fair share is given by: 
o n ■ W - A/) - iWO ^ o fl motf (/ - AQ - fl m «,(Q _ 

Thus, we have a situation where B mod (l clit ) > li > B mnd {l) > Bf aiT , i.e., one where the 
error in estimating Z, coupled with the error introduced by the formula (2.2) results 
in erroneous throughput estimation. In the next section, we show how this may even 
result in unfair allocation of resources. 



3.4 Non-Convergence to Fair-Share 

In the earlier sections, we have shown that the errors in estimating the loss fraction 
and errors in the TCP throughput formula could result in the following situations: 

(0 B /ccd > fl > Bj Q ir OR (ii) Bf ccd < ix < B fair 

where // is the rate of the non-TCP flow, Bf air is the fair share, and Bf ecd = B{l csl ) 
is the feedback parameter. 

In this section, we show that when B/. cli > (resp <)//. > (resp <)B/ niri the non- 
TCP rate never converges to the fair share in steady state. In addition, the long-term 
average throughput of the non-TCP flow does not converge to the fair share either. 

We prove this for the case where B/ ccd > \i > Z?/ 0 i r , and the proof for the case 
Bfwd < ft < Bf a ir is similar. We first observe that Bj cn d = B(l cxt ), where l ns i is 
itself a function of }i. Therefore, we can write B; ceH = B n {//). Assuming B(t cst ) is a 
continuous function, and Z cst is a continuous function of /£, B n (ft) is also continuous. 
We know that B n (jj) > /x. Hence, we have either: 

B n (v) = u for some v > ft (3.8) 

Or: 

B u [y) >u for all v > p (3.9) 
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If (3.8) is true, then the feedback parameter when the initial rate of the non-TCP 
flow is // is given by B n (v) = u. Therefore, the rate as determined by the non-TCP 
How control protocol converges to v > \i > B fair , The long-term average of the flow 
is also v. 

If (3.9) is true, when the initial rate f 0 is greater than ft, it can be seen that the 
feedback parameter B°{vo) is greater than i/ 0 . Writing U\ = 23 n (eo), and in general, 
V{+\ = B u {ui) for i > 0 T we have: 

B*{vi) > v { > i/ 0 

In practice, = min (# n (f;), B m0T ), but since v { < B max , we can still write: 

^i+l > U i > "o 

This means i/ 4 > fx > B fttir , for all i > 0. The long-term average of the flow is given 
by: 

7oo = lim > lim = M > B/«v 

4 Numerical Examples 

We consider some numerical examples of cases where the flow control algorithm using 
formula based feedback results in unfair allocation of resources. 

In the first example, the formula is assumed to accurately compute the TCP 
throughput when the loss rate is correctly estimated, and the error in TCP through- 
put estimation is due to erroneous estimation of the loss fraction. There are two TCP 
sessions sharing the bottleneck link which has a bandwidth B of 500 KB/s. The fair 
share is therefore 166.7 KB/s. We assume that the non-TCP flow has an initial rate 
fi of 190 KB/s. An ns simulation was accordingly set-up, with two TCP connections 
sharing a 4 Mb/s link with a constant bit-rate source sending packets at 190 KB/s. 
The observed loss fraction in the simulation is 0.0156. However, the loss fraction 
as estimated by dividing the number of loss events by the total number of packets 
transmitted is 0.0125, Using the same simulation set-up, it was determined that a 
loss fraction of .0125 corresponds to an actual transmitting rate \i of 120 KB/s by 



12 



the constant bit-rate source. Therefore, any correct formula estimates the TCP rate 
as !Ljl - 500-120 - 190. In other words, the feedback parameter Bj, :c(i is equal to 
H though \l is significantly larger than the fair share. Since \i = 190, this leaves less 
than 155 KB/s for eitlier of the TCP connections. Thus, the bandwidth allocation 
to // is at least 23% greater than that for each TCP connection. 

In the second example, we illustrate the effects of estimating the TCP throughput 
based on the formula given in (2.2). There could be an error introduced by the 
formula, compounded by an error due to inaccurate estimation of the loss fraction. 
Among the measurements provided in [3], we consider the case where 100 serially 
initiated TCP connections were established for 100 second intervals between two 
hosts. An average throughput of 17.13 KB/s was observed per connection, with a loss 
fraction of 0.0078, mean RTT of 0.2501 seconds and time out period of 2.5127. The 
throughput estimation, as given by (2.2) was 33.4 KB/s. Now, suppose the bottleneck 
is a 49G KB/s 4 Mbps) link shared with 27 TCP sessions. The fair share is 17.75 
KB/s, but when \t = 33.5 KB/s, the TCP throughput is 17.13 KB/s. Assuming the 
TO and RTT parameters are the sanie as in the measurement described above, the 
approximate loss fraction is .0078, and the formula in (2.2) estimates the TCP 
throughput as 33.4 KB/s. Assuming l aci « Z npp , and applying the transformation 
for test given in (3.4) T the estimate of the loss fraction on the non-TCP connection 
is .0070. Substituting this value for / in equation (2.2), B fctut = #(0 is 34.01 KB/s 
which is higher than // although \i itself is higher than the fair share. Thus, the 
resource allocation to // is atleast about 89% higher than the fair share. 
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Abstract 



We propose a new IP multicast congestion control protocol based on 
intelligent router packet filtering that allows data rate differentiation 
in presence of heterogeneous network conditions. A primary target 
application is reliable content distribution from a single server to a 
large population of independently acting receivers. The objective is to 
enable receivers with varying bandwidth availability to each receive a 
copy of the content at the maximum rate allowed by their fair usage of 
the network, independent of other receivers. For this application, our 
congestion control protocol is ideally combined with an FEC reliability 
protocol. 

1 Introduction 

We propose a new IP multicast congestion control protocol based on in- 
telligent router packet filtering. The objective is to enable receivers with 
varying bandwidth connections to the server to each receive packets at their 
maximum fair rate independent of other receivers. 

A primary target application is reliable content distribution from a sin- 
gle server to a large population of independently acting receivers. For this 
application, our congestion control protocol is ideally combined with a re- 
liability protocol that ensures that any receiver that collects any subset of 

"Supported in part by NSF operating grant CCR-9S00452. 
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packets that are equal in aggregate length to the length of the content be- 
ing distributed can reliably recover the content. We hereafter call such a 
reliability protocol an FEC protocol. As we describe in more detail in sec- 
tion 7, the combination of an FEC protocol and our new congestion control 
protocol allows receivers to join an ongoing session at any point during the 
session and reliably download content at the speed of their connection to 
the server independent of other receivers. For other applications, such as 
streaming multimedia content distribution, other reliability protocols are 
ideally matched with our congestion control protocol. We discuss one such 
application and reliability protocol briefly in section 8. 

1.1 Congestion control protocol overview 

One of the main challenges in heterogeneous rate and congestion control is 
regulating the packet rate in specific points of the network, so that from 
that point a reduced rate flow is transmitted downstream, given that a 
certain — possibly higher — rate is received from up-steam. The congestion 
control protocol presented here controls the rate that packets flow to each 
receiver with no involvement by either the source or the receivers. The 
routers perform all of the flow and congestion control, where the mechanism 
used to slow down packet flow rates along certain paths is router based 
packet filtering. An FEC protocol should be used in conjunction with this 
congestion control protocol for reliable content delivery in order to guarantee 
that each receiver can reconstruct the content from a minimal amount of 
packet reception independent of the overall filtering rate from the source to 
that receiver. 

The congestion control protocol consists of four parts: 

1. Link congestion measurement 

2. Upstream congestion reports 

3. Interface rate computation 

4. Interface rate filtering 

Link congestion measurement monitors the packet drop rate along each 
link in the multicast tree. This measurement is not particular to the IP 
multicast flow across the link, but measures the aggregate of all flows across 
the link, which could include competing TCP/IP traffic as well as other 
IP multicast flow traffic from other sessions. Upstream congestion reports 



are used to report the congestion measurements appropriately up the IP 
multicast tree —from the receivers towards the multicast source— to the 
point of regulation. Interface rate computation uses the congestion reports 
to determine the rate at which packets should be allowed to flow along each 
outgoing router interface for the IP multicast session. A possible desirable 
property is that the computed IP multicast session rate achieves the fair 
amount of the bandwidth across the interlace, but no more (fair to other 
flows across the interface, e.g., TCP/IP flows). The way congestion reports 
are translated into interface rates is not explicitly defined here, as it depends 
on the desired form of fairness. However possible formula-based ways of 
inferring these rates are mentioned. Finally, an interface rate filtering policy 
is used to ensure that the IP multicast packets for the session flow out of 
each outgoing interface at the computed rate for that interface. This policy 
is essentially a simple packet filtering. 

2 Other congestion control solutions 

A common approach to multicast congestion control is sender-based rate 
regulation [4]. This simplifies the task of regulation itself but precludes the 
possibility for rate differentiation, posing the problem of choosing what the 
single rate should be. The most common solution is having the source that 
throttles its outgoing packet rate down to the lowest among the bandwidth 
shares sustainable by a participating receiver or by a crossed link. This 
conservative approach assure that the multicast session does not use more 
than its fair share on any of the paths or links crossed. Thus, participating 
receivers that have higher available bandwidth paths to the server suffer at 
the expense of receivers that have lower available bandwidth paths. 

Solutions that support rate differentiation on multiple end-to-end path 
include layered multicast [8, 15, 3]. These solutions have basically the same 
properties as the overall congestion control protocol proposed here, except 
that they perform coarser grain congestion control (in terms of possible 
achievable rates and reaction speeds to changes in network conditions) and 
they involve receivers in the congestion control instead of routers. This 
poses some inter-receiver coordination challenges but has the advantage that 
doesn't need router modifications. Another disadvantage with the layered 
solution is that it requires the use of multiple multicast group to implement 
a single session, which is not as efficient for source, receivers, and network 
resource utilization. 

Store and forward solutions also address the heterogeneity issue by means 



of caching incoming packets at specific nodes and retransmitting them down- 
stream at lower rates. Compared to other approaches, these solutions utilize 
network bandwidth the best —packets are transmitted only once on each 
link— and the source does not have to keep transmitting packets once down- 
stream nodes have a complete copy, and this property iterates to nodes fur- 
ther downstream. This solution however presents drawbacks and not com- 
pletely solved problems. The major disadvantage is the presence of store- 
and-forward agents that needs to be managed and administrated (including 
space requirements and fast access to stored packets for retransmission). 
Store-and-forward agents also introduce the unresolved issue of dynamically 
placing/activating them in the points where their action is needed, unless 
their ubiquitous presence is advocated, in which case the resources needed, 
in term of storage and protocol-processing power, could be too large to make 
this a viable solution. 

A direction of possible future research involves combining the solutions 
proposed here with caching solutions. Another possible direction for future 
research involves considering a solution where multiple sources are multicas- 
ting the same content, and receivers can hook into one or more of the streams 
emanating from the sources. The reason for considering this is the possi- 
bilities for additional reliability and increases in performance. Performance 
may improve due to better load balancing of the bandwidth throughout the 
network. 

3 Interface rate filtering 

This section describes how routers regulates an IP multicast flow on their 
outgoing interfaces, assuming that they receive appropriate congestion re- 
ports from downstream nodes, and that the congestion control algorithm 
they run turns these congestion reports into packet rate to use on the out- 
going interfaces. 

Each router is aware of the packet flow rate R from the source. This 
information is delivered to the routers in the IP multicast tree through 
SPMs [12] (session path messages). SPMs also have the function of starting 
the protocol machinery along the multicast tree 1 . We assume that the up- 
stream congestion reports have already been used to compute an interface 
filter rate /; for each outgoing interface i. The value of is the fraction of 

'SPMs can also be used to carry other information such as dynamically configurable 
session parameters, and can serve as means to discover protocol-capable upstream neigh- 
bors. 



the source packet flow rate that should be allowed to flow through interface 
i, so that the throughput for the session through interface i is fiR. 

Suppose that there is an interface t upstream of interface t that has a 
filter rate /.#. A key goal is to ensure that if / £ # is less than fi then no 
additional filtering takes place at interface i, and if f { * is greater than fi 
then the packet flow is filtered down from rate f* to rate /{ at interface i 
by filtering out a fraction ff - fi of the original source packet flow. 

The filtering mechanism we recommend relies on a sequence number 
included in each packet. The basic idea is to filter out packets based on 
their sequence number as follows. Suppose that a packet is received for 
possible transmission out interface i with sequence number Ooaia2*..a 5 -ia si 
where aj is the j-th bit of the sequence number. Consider the binary number 
b = 0.a 5 a 5 -i...a2aiao, i.e., b is the sequence number written in reverse order 
expressed as a real number that is between 0 and 1. Then, the packet is only 
allowed to go out of interface i if b < fu and otherwise the packet is filtered 
out. It turns out that this simple scheme satisfies the key goal described 
above, and it also has the property that the flow rate out of the interface is 
as smooth as is possible assuming packets are not lost for other reasons and 
that sequence numbers are consecutive 2 . 

It is important when measuring drop rates on interfaces to not count 
packets that are intentionally filtered out, as these drops are not a conse- 
quence of congestion but a mean to perform flow regulation. 

4 Upstream congestion reports 

Assuming that estimates of the drop rate per outgoing interface are available 
at every router, this section describes how the appropriate congestion signal 
is reported from the routers near the receivers upstream towards the source 
to the nodes which perform flow regulation. 

Two options seem viable to report congestion upstream. The most 
straightforward one is using some direct metric of congestion, such as the 
packet drop rate on outgoing interfaces. The other option is computing the 
data rate that a router is willing to receive —based on the rates which the 
router is allowed to transmit on its outgoing interfaces— and reporting this 

2 This turns out to be a consequence of some deep work in discrepancy theory by 
a mathematician by the name of Schmidt and others that is beyond the scope of this 
document. Schmidt won a Fields medal in part for proving that there is no smoother 
scheme, and this is the hard part. Computing the smoothness of this scheme is not that 
hard. It turns out that it is a lot smoother than would be obtained by doing random 
filtering. 
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upstream. In this case the rate computation is performed where congestion 
occurs. 

Although the first solution 'seems the most obvious, the second is pre- 
ferred as it allows more flexibility. With this, for example, it is possible to 
bias the aggressiveness of the flows in sharing the bandwidth according to 
topological considerations 3 . 

The basic rule to collect/forward congestion reports is the following. For 
each outgoing interface i from router r, r computes a filter rate # based on 
the drop rate (congestion) on this interface —recall that a filter rate is a 
fraction of the packets transmitted by the source that needs to be filtered 
out to achieve the throughput established by the congestion control policy. 
For each outgoing interface i from router r, r has received a filter rate hi 
from the downstream router along this interface for the IP multicast session 
(initially, hi is set to 1). Then, the target filter rate fi for the session along 
outgoing interface i is set to the minimum of gi and h(. When router r passes 
a filter rate to the upstream router for the session it sends the maximum of 
the JiS over all interfaces i that carry the session. Router r also uses the 
value of fi as a target to perform packet filtering for the session along each 
interface t. 

The rationale behind the rule is that the rate of packet flow down an 
interface t should not exceed the value # that is computed based on its fair 
share. On the other hand, if this is not the limiting rate, then the packet 
flow rate should also not exceed the value hi that the downstream router 
has indicated that this interface can handle. Finally, the report of the filter 
rate to the upstream router from router r should be set to the maximum of 
all the rates that all the interfaces out of 7* can handle. 

Note that, once a new value for fi is computed on interface t, this might 
either be used immediately as filter rate to apply to that interface, or it might . 
determine the actual filter rate in some less direct way. As an example, 
the congestion control scheme used might impose that a rate increase in 
an interface is not applied immediately but gradually through a smoothing 
low-pass function. 

The amount of memory needed to implement this scheme is k +1 words 
at outgoing interface i if there are ft multicast sessions that are currently 
sending packets through interface i —one word for the hi of each session 
and one for Qi (global to all the sessions). 

3 A possible application for this is biasing the aggressiveness according to the distance 
from the root of the tree, so that flows which aggregate more potential receivers (on links 
closer to the root) can given higher priority. Priority can also be based on the number or 
downstream receivers, ir this information is available. 
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4.1 Report periodicity 

Each node (router) generates periodic reports upstream to refresh the state 
in its parent router and possibly in some of its direct ancestors. The inter- 
reports period determines the protocol responsiveness to changes in network 
load and the accuracy of the control system. It also affects, however, the 
protocol overhead in terms of control traffic. Possible options to determine 
this period are either using a protocol constant, tuning it through SPMs or 
making it a function of the speed of variation of the value being reported. 
This adaptive solution is preferred as it allows dynamic tuning of the peri- 
odicity, increasing the control traffic only if needed. The following heuristic 
is suggested to implement the adaptive scheme without incurring error ac- 
cumulation: reports are sent when the previously sent report differ by more 
than 5 from the current value (determined as the maximum of the /;s). 
However, if no reports have been sent for t max seconds, a report is sent in 
any case. The purpose is to refresh an upstream state that may be out of 
date due to lost reports or a number of to small to report changes in the 
filter rate. 

The value of 5 should be a small percentage of the actual value. It should 
be chosen also taking into account the method used to measure the interface 
congestion statistics (see section 5): it should be larger than the expected 
measurement error, so to prevent high report traffic due to measurement 
noise only. The value of t max should be short enough to provide the degree 
of error resilience wanted, its lower limit being determined by how frequently 
the congestion statistics are updated. 

To assess the overhead generated by control traffic (reports), consider 
both the case of a completely static situation (load values in links do not 
change) and the case where the load distribution across the network changes 
much slowly than the frequency of reports. In the static situation, where 
congestion conditions do not vary, each node generates periodic reports every 
tmax seconds, which cross one link only, as the updates received in each node 
do not change the local status. In the dynamic situation, consider the most 
general case of router that is root of n sub-trees. It will receive reports 
from all of them with a certain periodicity. Not all the reports however 
will affect the maximum If we assume that congestion distribution varies 
slower than the report periodicity, only one among the subtree reports and 
the local g{S will affect the maximum hence the router will generate 
reports with the same periodicity as that of the dominant subtree or the 
one imposed by one of the local gis. If we apply this recursively, we can 
see that the periodicity is determined by one of the </,*s of some router, thus 
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the upper bound for the overall report traffic is given by the speed of load 
changes in that link. This is in turn bounded by the measurement process 
(see section 5). 

5 Interface congestion measurement 

Each router collects traffic statistics on outgoing interfaces and uses them 
to infer traffic load on links. A number of different observations could be 
made on link packet-queues to estimate link utilization, which include queue 
size and number of packets dropped. Using queuse size —or other direct 
measurements of the link load— allows to perform congestion control with 
zero —or very small— steady loss rate. On the contrary, using loss rate 
implies that the stable operating point is reached only in presence of a 
certain loss rate in links, as there would not be feedback at all otherwise. 
In other words congestion control is performed by slightly overloading the 
network and using the observed loss rate to decide whether to increase or 
decrease the offered load. 

Although TCP is indirectly based on both, the loss rate is one of the 
factors that mainly determines the long term data rate. This means that, in 
presence of competing TCP flows, any degree of fairness can only be reached 
with significant loss rate in the steady operating condition. In provision for 
this we base our congestion statistics on percentage of packets dropped on 
routers outgoing link queue. 

Records on the number of packets forwarded and dropped already exist 
in routers, hence the only addition required to estimate the percentage of 
packets dropped is a function that compute a moving average of the ratio of 
packets dropped versus packets queued to be forwarded. A simple way to do 
this is using an exponentially weighted average, where the weighting factor is 
chose considering responsiveness requirement and measurement error issues. 

6 Interface rate computation 

The algorithm used to turn the congestion statistic into a target data rate 
and to make the current rate converge to the target rate determines the 
form of fairness achievable and the stability of the control system. Although 
covering this issues is beyond the scope of this document, we will discuss 
briefly the choice of the function that determines the long-term data rate 
from the congestion measurements. 
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The most natural usage of this architecture is implementing Max/Min 
fairness, where bandwidth is shared on a per-link basis. A simple way of 
implementing tins kind of fairness is translating congestion statistics into 
link rates through a rnonotonicaliy decreasing function. If all routers use 
the same function for all the flows, then Max/Min fairness is automatically 
achieved despite the actual shape of the function. The shape of the function 
however determines the stability of the system, determines its convergence 
speed and determines the steady loss rate in links as a function of the number 
of flows crossing them. 

If the function chosen is one of the flavors of the TCP equivalence for- 
mula [7, 9], then an approximation of TCP- like fairness can be implemented. 
TCP-like fairness is generally extended to multicast through the foDowing 
definition "a multicast session is TCP-fair if its rate on any of the end-to-end 
paths source-receiver is lower or equal to the rate that a TCP connection 
would achieve in that path". The long-term rate of a TCP connection de- 
pends on the cumulative loss rate of the path traversed and on its round-trip 
time (RTT). As we do not have any provision for estimating RTTs and for 
cumulating path loss rate —although these could be easily introduced in 
our architecture — we can only achieve some kind of approximation of TCP- 
fairness. The simplest approximation would be using single link loss rate 
and an average-TCP connection RTT, say RTT, and the TCP-equivalence 
formula to compute the rate. This approximation would provide a form of 
fairness that can be stated like this: "the multicast session achieves in each 
link a rate never (much) larger than the rate a TCP connection crossing 
that link would achieve if that l ink w as the dominant bottleneck in the TCP 
path and if the path RTT was RTT". 

The shape of the function used to compute the rate can also be deter- 
mined on a per-session basis and/or changed dynamically. A simple way of 
achieving this is using session path messages (SPMs) to carry the function 
parameters. This would allow easy tuning of several parameters to determine 
the congestion control behavior, such as aggressiveness and responsiveness. 

Another interesting possibility is using slightly different functions in dif- 
ferent nodes of the session tree. This would, for example, allow to bias the 
flow aggressiveness in sharing bandwidth, according to the distance from the 
source, the number of downstream receivers or the node fan-out. In this way 
multicast specific sharing policies (e.g. [5]) could be easily implemented. 
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7 Content Delivery 



A primary target application of the new congestion control protocol is a 
reliable content delivery system from a single server to a large population of 
independently acting receivers. For this application, our congestion control 
protocol should be used with an FEC protocol. An FEC protocol or an 
approximation thereof can be constructed using either standard or more 
advanced FEC codes, e.g, see [11, 6, 3]. Recall that an ideal FEC protocol 
ensures that any receiver that collects any subset of sent packets that are 
equal in aggregate length to the length of the content being distributed can 
reliably recover the content. Thus, for an FEC protocol, complete recovery 
of the content depends only on the number of packets received, and the 
number required to recover is independent of packet loss patterns and when 
the receiver joined and left the session. 

With our congestion control protocol, packets are intentionally filtered 
out to adjust the flow rate to each individual receiver down to their fair 
share of the bandwidth connection to the server. Because recovery of the 
content depends only on receiving the minimal possible number of packets 
independent of loss patterns when using an FEC protocol, this leads to the 
ideal behavior that each receiver recovers the content at their maximum fair 
rate independent of other receivers. This leads to a reliable content delivery 
system with the following desirable properties. 

o Each receiver is able to receive packets at the maximum fair rate at 
each point in time independent of the other receivers, e.g. a receiver 
that has a more constrained bandwidth connection to the source should 
not slow down transmission to receivers with less constrained band- 
width connections. 

o Each receiver is able to join and leave the session at its discretion 
without adversely affecting the source or the other receivers. 

o The source is able to transmit packets at a single fixed rate to one 
multicast session. This is the maximum rate available to receivers. 

o Receivers are able to receive packets without any feedback required 
for congestion or rate control. 

o Each receiver is able to recover the content as soon they have received 
a number of packets that have aggregate length equal to that of the 
content, and this is independent of the other receivers. 



10 



8 Layered Video 



Layered video is an application that is well-suited for our new congestion 
control protocol. The basic idea behind layered video is that there are 
several streams of video. Reception of the base layer stream provides a 
reasonable quality video playback experience, and as more and more en- 
hancement layers are received the quality incrementally improves. These 
layers are ordered, in the sense that the first enhancement layer is of no 
use unless the base layer is received, the second enhancement layer is of no 
use unless both the base and the first enhancement layer are received, etc. 
Multicast schemes have been proposed that allow different receivers with 
heterogeneous connections to the server to receive a quality of video stream 
that is proportional to their connection (see e.g [13] or [14]). 

We propose a reliability protocol that could be used in conjunction with 
the congestion control protocol already described for this application. Sup- 
pose that the full video stream is being transmitted at some rate R. Suppose 
that layer i of the stream comprises a fraction p,- of the raw stream. Let 
c 0 - 0 and in general let a = Ej<i*Cj. Then, one naive way to send the 
stream in conjunction with the congestion control protocol is to let the £-th 
stream be placed into packets with sequence number j such that when j is 
written in reverse binary as a number between 0 and 1 its value is between 
Cf-i and c,*. This ensures that when packets are intentionally filtered out 
that the enhancement layers are filtered out first in the appropriate order 
and the base layer last. Thus, if there were no losses due to other causes 
then this would be an ideal way to place the stream into packets. 

Unfortunately, there will be losses of packets due to other causes, and in 
particular it is packet loss that triggers the congestion control protocol in 
the first place to filter out packets purposely. Thus, a much better idea is 
to partition each layer into blocks and then use FEC codes on these blocks 
to add an appropriate amount of redundancy to protect it against packet 
loss due to congestion (but not due to packet filtering). With this in mind, 
it is clear that the base layer should be protected more than the first en- 
hancement layer, and the first enhancement layer should be protected more 
than the second, etc. This is because the congestion control protocol filters 
packets according to the loss statistics, and for example the filtering down 
to the base layer will be triggered by heavier packet loss due to congestion 
control than filtering out just the top level enhancement layer. Thus, when 
the congestion control protocol filters out all the layers except for the base 
layer, enough protection should be added to the base layer to protect it 
against the packet loss rate due to congestion control that triggered that 
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level of filtering. 

As an example, suppose that there are three layers of video, each com- 
prising a 1/3 fraction of the raw stream rate R. Suppose further that a 
Idown to a rate of 2/3 of R, and that a 10a packet filtering down to a rate 
of 1/3 of R. Then, the base layer should be protected with at least lOlayer 
should be protected with at least ldone for example in order to have a good 
chance to recover the base layer from the received packets when there is a 
lOcongestion (and thus the filtering discards all but the packets in the base 
layer). 

There are a lot of details yet to be worked on this subject. For example, 
it would make sense to have only a fixed set of packet filtering rates allowed 
according to the layers of the raw stream, and to come up with a reasonable 
way to set the block lengths for each layer and the redundancy added to 
each layer according to the average and maximum packet loss statistics due 
to congestion that would cause the filtering down to this layer. 
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Abxtmct— An Important concern for source-bused multicast confiestlan 
comrol algorithm* is the luss path multiplicity (U*M) problem that arises 
aecauw: 11 transmitted packet can be lost on one or more «f the many end-to* 
-ml paths In a multicast tree. Consequently, If □ multicast source's trans- 
mission rote Is reflated according to loss Indications from receiver*, the 
rate nmy be completely throttled as the number i>r loss paths Increases. 
In this paper, we analyze a family of additive Increase multiplicative de- 
crease Conxion control algorithms and show that, unless cartful atten- 
tion is paid in the U'.M problem, the average session bandwidth of a mul- 
ticast seislon may be reduced drastically as the slit of the multl' ust »ruup 
increases. This mufces U Impassible to Shore bandwidth In a tms-minjuir 
manner unions unicast and multicast sessioas. We show that max-min fair* 
m 1 *: can be mrheivw! however, tr every multicast session regulates its rale 
accord! na to the rotwi nmyesled end-lo-end path in its multicast tree. We 
present an idealized protocol for tracking the most congested path under 
clmnfins network conditions, and use simulations to illustrate tnal track* 
inj> the most con^cMud path is Indeed a promWing approach. 

Keywords — multicast, conuesilun control, multiple loss paths, max-min 
fairness. 



i. Introduction 

THE deployment of multicast services in wide-area net- 
works is e.tpecied to lead to a proliferation of poini-to- 
multipoint applications in the near future. Such applications will 
constitute a significant ponion of the overall neiwork traffic and 
will compete with existing poini-to-point applications for net- 
work bandwidth. Hence it is necessary to control and regulate 
their bandwidth consumption in order to prevent network con- 
gestion. 

One possible approach towards multicast congestion control 
is source-based rale control, in which a multicast source reg- 
ulates its transmission rate in response to loss indications (eg., 
NAKs) from receivers. A number of specific source-based rate 
control schemes have been proposed [1-3]; these represent im- 
ponnnt first solutions in a very large solution space. However, 
a number of fundamental issues remain open and have to be 
addressed by any source-based approach towards multicast con- 
gestion control. In this paper, we identify and examine ».wo such 
issues. 

First, the loss indications received by a multicast source from 
multiple receivers reflect diverse congestion conditions in var- 
ious parts of the network, and have to be appropriately com- 
bined when making a single rate control decision, A transmit- 
ted pricket may be lost on one or more of the many end-to-end 

This work was supported bv the N'aiional Science Foundation under grant 
NCR 95GiW and hy TaSC under suhcomract J0B899-S971 14. Any opinions. 
Undin^, and conclusions or recommendations expressed in ;his material are 
lho*c of Ihe :ni!hnn$) and Jo not necessarily retket the view* of ihc Nstional 
Scier.ce Foundation. 



paths in a multicast distribution tree. The number of such loss 
paths is likely to increase with an increase in the number of re- 
ceivers; hence the probability thnl ihe source receives at least 
one loss indication for every transmitted packet becomes high. 
If the source reduces its rate in response to every loss indication 
that it receives, its transmission rate will be severely throttled. 

The second important issue concerns fairness in bandwidth 
sharing among unicast and multicast sessions. Multicast con- 
nections should not be allowed to usurp a large share of band- 
width since that may starve unicast connections. On die other 
hand, a multicast session's rate should not be throttled to ihe 
extent lhat its bandwidth share is drastically reduced, since that 
will discourage the widespread deployment and use of multicast 
technology. 

The central issue of interest in this paper is how a multi- 
cast source combines loss indications from multiple receivers in 
its multicast distribution tree for rate regulation, and how such : 
combinations affects fairness in bandwidth sharing. We analyze 
a family of additive increase multiplicative decrease congestion 
control algorithms [4), and show that, unless careful attention is 
paid to -he existence of multiple loss paths in a multicast tree, 
the average bandwidth share of a multicast session may be re- 
duced draslicaJly as the size of the multicast group grows. Our 
results also indicate lhat it is impossible to share bandwidth in a 
max-min fair manner unless the problem of multiple loss paths 
is addressed. Our definition of max-min fairness is based on al- 
locating bandwidth to a multicast session according to the most 
congested path in its multicast tree. Wc show that it is possi- 
ble to ensure max-min fairness according to this definition only 
if every multicast source regulates its rate according to the most 
congested path in its tree, or equivalcmly, according to loss indi- 
cations from only Ihe 'Mossiest" receiver in the multicast group. 
Wc present simulation results thai show that tracking the worst 
receiver is indeed a promising approach. 

The rest of the paper is organized as follows. Section 2 
presents an discussion on the problem arising out of the exis- 
tence of multiple loss paths in a multicast tree, and its effect 
on fair bandwidth sharing. In Section 3, we describe a fam- 
ily of additive increase multiplicative decrease congestion algo- 
rithms and express the average session bandwidth as a function 
of the observed loss probability at the source for these algo- 
rithms. Section 4 describes a method for analytically deriving 
the loss indication probability at a multicast source for some ol 
these algorithms, Tn duing so, we uncover some differences be- 
tween applications where the source retransmits lost data pack- 
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cts and ones where it does not do so. In Section 5, we present 
a number of case studies where we compute the average session 
bandwidth for some of these algorithms for a number of net- 
work scenarios. The results illustrate the severe degradation in 
a multicast session's bandwidth share when the source responds 
to loss indications from all receivers in the group. Section 6 
presents simulation results thai indicate that it is indeed possible 
to eliminate the problem of multiple loss paths, and to achieve 
max- mi n fair sharing of bandwidth by regulating a source's rate 
according to loss indications from only the worst receiver in the 
multicast group. Section 7 concludes the paper. 

11. The Multicast Loss Path Multiplicity Problem 
and Fairness 

The most widely used unicast congestion control approach 
in the Internet is end-to-end control of user traffic at the trans- 
port level. In this approach, each traffic source regulates its rate 
based on loss (and/or delay) feedback from its receiver. The 
source maintains a rate (or window, as in the case of TCP [5\) 
parameter that is decreased muliiplicatively every time a conges- 
tion feedback (e.g., loss indication) is received, and increased 
iiddiiivcly otherwise. 

Let us now consider extending this approach to a multicast 
source, with the source adjusting its rate in response to loss in- 
dications (Lis) from receivers in us multicast group. This gives 
rise to two problems. The first is the problem of spatial loss 
correlation • a single packet toss may affect multiple receivers; 
hence the source may receive more than one LI for the loss. If 
the source reduces its rale in response to each such LI, it will 
have ovcrcom pen sated for the single loss. One possible v/ay of 
countering this problem is to have the source reduce its rate less 
aggressively for each individual Li. Alternatively, assuming all 
Lis for the same pricket loss reach the source within a certain 
time window, the source can react to only one of them and ig- 
nore the rest [1], 

The second problem arises due to the existence of multiple 
end-to-end paths in a multicast tree. Suppose that a multicast 
source reduces lis rate in response to Lis from all its receivers, 
bm reacts to no more than one LI per transmitted packet. How- 
ever, a transmitted packet may be lost independently on one or 
more of the multiple paths in the tree. As the number of such 
paths increases, the probability that the source receives at least 
one LI per transmitted packei also increases. We refer to this 
problem as the loss path multiplicity (LPM) problem. In order 
to gain an intuitive understanding of the problem and its effect, 
let us consider a multicast group with n receivers, each indepen- 
dently experiencing a loss probability of p. Tnen the probability 
that the source receives at least one LI per transmitted packet 
is given by Q = 1 - {1 ~ rf rt . As n -+ oo, Q 1. There- 
fore the multicast source regulates its rate as if it were observing 
a single network path with loss probability Q, and the average 
session bandwidth is very low. 

If the LPM problem reduces the bandwidth share of multicast 
sessions, competing unicast sessions will receive most of the 
available network bandwidth, resulting in unfairness in band- 
width sharing. In order to evaluate the extent of this unfairness, 
we introduce the following fairness criterion. First, neither uni- 
cast sessions nor multicast sessions are tn be given prcfercn- 
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Fig. I. General representation of FLICA algorithms 

tial treatment when allocating bandwidth. Second, bandwidth 
is allocated to each multicast session according to the most con- 
gested path in its tree. Such a policy conforms to the widely pop- 
ular notion of max-min fairness [6). For example, suppose there 
is an amount of bnndwidlh B. available on the most bandwidth- 
constrained path in a multicast tree and there is one more unicast 
session that traverses this path. Then under the fairness defini- 
tion, the multicast and the unicast session will each be allocated 
a share B/2. Consequently the multicast session would also be 
allocated bandwidth B/2 on every other path in its tree, even if 
there is excess capacity on those paths. The excess bandwidth 
on those paths is then available to other sessions that traverse 
those paths. Thus our policy makes no assumptions about the 
kind of preferential treatment to be given to any session. Also, 
it allocates multicast bandwidth based on the notion that a mul- 
ticast session can only use as much bandwidth as is available on 
the most bandwidth- constrained path. 

In the rest of the paper, we examine how the LPM problem 
may introduce unfairness according to this definition of max- 
min fairness. In the course of our study, we also identify a 
promising end-to-end approach for ensuring fairness. This ap- 
proach is based on having each multicast source identity the 
most congested path in its distribution tree, by identifying the 
"lossiesf or "worst" receiver i.e., the one experiencing the high- 
est end-to-end loss probability. The source rate is then regulated 
in response to Lis from only this receiver. Of course, algorithms 
for increasing and decreasing the source rate have to be chosen 
such that any two sessions experiencing the same end-to-end 
loss probabilities will receive equal shares of bandwidth. 

III. A Family of Rate Control Algorithms 

In this section we describe a family of additive increase mul- 
tiplicative decrease algorithms(ATMD) J4], collectively referred 
to as FLICA (Filtered Loss Indication-based Congestion Avoid- 
ance), for regulating amuHicast source's transmission rate. Each 
algorithm in the class decreases the transmission rale multiplica- 
tively in response to Lis from receivers, and increases it addi- 
tively in the absence of Lis. From the discussion in Section II, 
we observe that every LI from every receiver may not be consid- 
ered for rate adjustment. The source decides which Lis to use for 
this purpose and filters out the rest. Let us define a congestion 
signal (CS) as an LI that the source uses for rate adjustment. 
We can identify two main components of any FLICA algorithm 
(Figure 1 ) : 

• a Loss Indication Filter (LIF) : this determines which of 
the Lis received arc to be considered as CSs. 

• n Rate Adjustment Algorithm : an algorithm that deter- 
mines how to decrease the rate when a CS is received and 
how to increase the rate in the absence of CSs. 
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The design of the L1F is policy dependent. For example, in 
ihc case of a representative-based scheme [2), where a source 
responds lo Lis only from a subset of receivers designated 
as representatives, an LIF would filter out all Lis from non- 
representatives. Instead, the LIF may be timer-driven, letting 
through no more than one Lt within a certain lime interval. Such 
a time-driven LIF corresponds closely to the LTRC scheme in 
[ | ]. The LIF filter need not necessarily be located at the source, 
and may be centralized or distributed. For example, the rep- 
resentative scheme in 12] proposes that non-representative re- 
ceivers suppress their Lis using backoff timers. For an active 
network protocol such as the one proposed in [7], fillers are ac- 
tually located at the active nodes inside the network and selec- 
tively forward Lis towards the source. Note that Figure 1 ap- 
plies to unicasi sources as well, with the LIF in thai case letting 
through all Lis from the single receiver. 

For every FLIC A algorithm* the source maintains a variable r 
ihat represents the current transmission rate of the source. The 
value of r is adjusted in response to CSs in the following man- 
ner: 

On receiving a CS. * r r ~ r/C, 

In the absence of any CS for time S, : r<-r-rl. 

where C and 5 are adjustable parameters. Therefore the trans- 
mission rate is reduced by 1 jO of its current value on receiving 
a congestion signal (multiplicative decrease) . In the absence of 
such signals, r is increased by 1 every 5 units of time (additive 
increase), A particular FLICA algorithm is completely defined 
by specifying its LIF and the values of C and S. 

Let us define the congestion signnl probability A as the prob- 
ability that the source receives a CS for an arbitrary transmit- 
ted packet. If B is the average session bandwidth obtained by 
the source under a FLICA algorithm, then the functional depen- 
dence of B on A is given by 



3{\) '■ 




The derivation of this result is provided in [8]. 

IV, Congestion; Signal Probabilities for some LIF 
Policies 

In this section, we consider a few specific LiF policies and de- 
scribe how to compute the value of A for these policies and for a 
given multicast topology. Computing the value of A is a prereq- 
uisite for analytically computing the average session bandwidth 
(equation [ 1 )) attained by a session for a particular FLICA algo- 
rithm. The LIF policies that we consider here are : 

. Puss-All : Of all the Lis received for a transmitted packet 
(new or retransmitted), only one is considered as a conges- 
tion signal, and this LI may be from any receiver in the 
multicast group, 
• Pass*K-of-N : Given A r receivers in a multicast group, K 
{}{ < A') receivers are designated as representatives. All 
Lis from the A' - K non-rcprcscntatives are ignored. If 
one or more LI{s) arc received from representatives for a 
transmitted packet, only one is considered as a CS. 



• Pass- Worst : The receiver with the highest end-to-etid 
probability is identified and all Lis from that receiver ure^ 
considered as CSs. Lis from all other receivers arc ignored ■¥ 

Note that Pass- All and Pass-Worst are special cases of Pass-K-'-^. 

of-N. However, we introduce them separately for ease of cxpp/^|- 

sition. 

Before we proceed with the derivation of A for these UFs, we^U 
need to make a distinction between two models of data dciiv-.;M 
cry. The first is reliable delivery, where the source retransmits b : *H 
data packet as many times as required, until the packet has been "ft 
delivered at least once to every receiver in the multicast group, 
In this case, the probability of generating a LI decreases with |§" 
repeated retransmissions of a packeL This must be taken into ^ 
account when deriving an expression for A. The second model fjp* 
of data delivery is tio- retransmissions delivery where the source 
does not perform any retransmissions. Loss indications (Lis) 3 
are used in this case solely for rate adjustments. This model is ^ 
applicable to continuous media applications or to reliable data : 51 
transfer applications with repair servers providing repairs for^ 
lost data packer. Unlike the reliable data delivery case. theJlJ 
probability of generating at least one LI for a packet is now theS§ 
same for every packet transmitted, since no packet is transmit-^" 
ted more than once by the source. We now present a method for/g 
computing the value of A in each case. >2 

A. Reliable Data Delivery 

Let us first consider the Pass-All LIF policy. Let T = {M , E\& 
be the multicast distribution tree spanning all receivers in a muK^g 
ticasl group, where M is the sei of nodes, £ is the set of directed ^ 
edges in the tree and all receivers are attached to leaf nodes of;| - " 
the tree. Let S e M denote the root of the tree (i.e. the node/: 
closest to the source) and let c[n), n c M. denote the set or^J 
child nodes of node n in T. 

Let R T {n) denote the number of times a packet has in be || 
transmitted to a node « 6 T. until it has been received at J| 
least once by all receivers downstream from n. Let F t [ be 
the probability distribution function for R T {n) and p„ be the g 
toss probability of a packet at node n, The expression for ^ 
F n r (0 = P{R r {n) < i) is given in [9] ; 

n is a leafe 



Therefore, 



otherwise^ 
(2)- 

(3) 



Since this is the expected number of times that a packet will 
be transmitted, the expected number of times that at least one 
receiver will lose the packet is E[R r [S)] - 1. For each of these 
times, the source will receive a CS. Hence the expected number 
of CSs generated per E[R T (S)} packets is E\R T (S)] - 1, and 



A = 1 - 



E\R r (S)) 



Now consider the Pass-K-of-N LIF. Let Q = (M\t) be the 
multicast distribution tree for only the representatives , where 



35B 



Supplied by The British Library - "The world's knowledge" 



u 




SOURCE 



Fig. 2. Modified Star Topology, g = loss probability o« h, . Pi 
on Li* 



M' C M and £ C S. Note that for the Pass- Worst LIF, the tree 
Q consists of the single end-to-end path from the source so ihe 
lossiest receiver. Then the expected number of times a packet 
has to be transmitted in order to be delivered at least once to 
each of the representatives is 



(5) 



i=0 



where F§{i) is defined by equation (2). Hence A is given by 



A = 



E\R T (S)} 



(6) 



Note that in the case that A" = A f t Q - 7\ the Pass-K-of-N 
L1F reduces to Pass-All and (6) reduces to (4), We have used 
the above technique to compute A for two simple topologies - 
a "modified star" (Figure 2) and a complete binary tree. The 
derivations are described in f 3), 

B. No-retransmission Data Delivery 

For the Pass-All UF, 1st us again start with an arbitrary mul- 
ticast tree T = (M t €) spanning all receivers in the multi- 
cast group. Lei p n be the probability of packet loss at node n, 
71 € M. Let Q% be the probability that a packet transmitted to 
node n is lost by at least one receiver downstream from n. Ql 
is computed recursively according to the following equation * 

n is a leaf node 
l-(l-Pn) II otherwise 

m~child{n) 



Then, 



where 5 is the root of 7\ 
For u Pass-K-of-N LIF, A is given as 

A = Q| 



As in the reliable data delivery case, f8} describes how to derive 
A for the modified star and the complete binary tree topologies 
using the above method. 



V. Case Studies 

In this section* we study the behavior of some specific FLICA 
algorithms for a modified star topology under several different 
network loss scenarios. The metric used for evaluating the per- 
formance of the algorithms is the average session bandwidth B. 
3 is computed using equation ( 1 ), with the value of A computed 
according to the method described in Section IV, The purpose of 
this study is to gain insights into the effect of the LPM problem 
and into its possible solutions. We also study the performance of 
= lass probability the congestion control algorithm proposed for the PGM proto- 
col with the goal of understanding whether, and to what extent, 
it is affected by the LPM problem. 

The FLICA algorithms studied here use different toss indica- 
tion filters (LIFs) but the same rate adjustment algorithm with 
C - 2.0 and S = 0.2 sec. 

Let the modified star (Figure 2) have A' = 50 receivers. Lei 
Qi be the end-to-end loss probability for receiver i. With pi 
and q as defined earlier* we then have pi = {Qi - q}/{\ - g). 
Let us define the independent loss ratio as /* = Pi/Qi- This 
is a measure of the fraction of Independent {i.e. not spatially 
correlated with any other receiver) loss for receiver i. 

Let us first consider identical loss probabilities for all re- 
ceivers, i.e. Qi - Q «s 0.05, i = l, — ,50. This implies 
that the independent loss rado is also the same for all receivers. 
Let /; = /, i = 1, ■ * ■ , 50. Figure 3 illustrates the dependence 
of A and JS on / in the case of applications requiring reliable 
data delivery. Figure 4 shows the same for applications using 
no-retransmission data delivery. 

We observe thai for a Pass-All LIF, there is a sharp increase 
in the value of A with increasing /. The effect is less significant 
for reliable delivery since the probability of getting a NAK de- 
creases with repeated retransmissions of a packet. On the other 
hand, with a Pass- Worst UF (in this case, any one receiver can 
be selected as the representative to track), there is no such sharp 
increase in A. For no-retransmission delivery, A is simply the 
end-to-end loss probability for any receiver; hence it remains 
invariant with /. Interestingly, in the reliable delivery case with 
a Pass- Worst LIF, A decreases with increasing /. The reason 
for this is as follows. Once the tracked receiver has received 
a packet, it ignores all subsequent retransmissions of the same 
packet. Hence if any such retransmission is lost by one of the 
other receivers, the source does not receive a CS for it. As the 
spatial loss correlation decreases, there is is greater chance that 
the tracked receiver receives a packet which one or more of the 
other receivers have lost. Since no CS is generated for any of 
the subsequent retransmissions, A decreases. 

For the Pass-All LIF, the increase in A with / leads to a dras- 
tic reduction in the bandwidth {£) actually used by the multi- 
cast session, since B is a decreasing function of A. Significantly, 
most of this reduction takes place between / ~ 0.0 and / = 0.1 , 
indicating that even small amounts of unconelated loss can have 
harmful consequences for a multicast session's average band- 
width. We also observe that, with a Pass-Worst LIF. there is no 
such degradation in B t since A remains more or less unchanged. 

FTom Figure 5, we observe that B scales poorly with the num- 
ber of receivers {N) for a Pass- All filter. The degradation is 
quite drastic even when ihe independent loss ratio, /, is as small 
as 0.1. This clearly shows the scalability problem introduced 



(7) 
(8) 

(9) 
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by loss path multiplicity for a FLICA algorithm [hat responds to 
Lis Tram all receivers. 

We have considered a scenario where the loss probability on 
one arm of the modified star is higher than ihe others [8], and 
have arrived at the same conclusions as in the uniform loss case 
described above. However, we do not describe the details here 
due to space constraints. 

From the results so far, we infer that the LPM problem arises 
from tracking Lis from a large number of receivers when not all 
the losses occur on the same end-lo-cnd network path in a mul- 
ticast tree, So it may be possible to use a representative scheme, 
where the source tracks only K of N receivers, to alleviate the 
LPM problem lo a certain extent. We now evaluate the perfor- 
mance of some such schemes by considering PLICA algorithms 
with a Pass-K-of-N LIF for a modified star (Figure 2). Let us 
chouse (without loss of generality) receivers 1. • • - , k to be the 
representatives. In addition to a FLICA algorithm with C = 2 . 
we consider FLICA algorithms with C = J\'+l. As thcvalueof 
K increases, A is expected to increase due to the LPM problem. 
For a fixed C, this has the effect of reducing S. However, if the 
source reacts less aggressively to each CS by using a larger value 
of C. then that should partially compensate for the the increase 
in B with A. Note that when there is a single representative, the 
value of C reduces to 2. 

Wc consider three different loss scenarios. In the first case, all 
receivers experience the same end-to-end loss probability. We 
chouse Qi = Q = 0.05 and h « / = 0.5. Hence q = 0.02546 
and p; = 0.025. We observe from Figure 6 that the degradation 
in B with increasing K is less severe when C = K + 1 than 
when C = 2, proving our intuition to be correct. In both cases 
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however, as the number of representatives increases beyond a 
certain value, there is a considerable dccrca.se in B. This implies 
thai only a representative scheme with a very small number of 
representatives ( A' < 5) can counter the effect of the LPM 
problem. 

The second toss scenario (Figure 7) differs from the first in 
that px = 3 * pi; i = 2, • ■ * , N. The loss probability values cho- 
sen arc q = U.()254G,/>, = 0.075 andp, = 0.025. i = 2, - • t jV. 
As expected, B decreases with K when C = 2, However, when 
C = K + hB initially increases with increasing K up lo about 
K w 5, before starting to decrease. The reason is as follows. 
For 2 < K < 5, the source docs observe a higher A when A' 
increases. However this is more than compensated for by the 
less aggressive reaction to every individual CS, which is a re- 
sult of the increase in the value of C. Hence the average session 
bandwidth. B, actually increases. 

The third loss scennrio is one in which every representative 
observes u significantly higher loss probability than every non- 
representative [&}. Due to space limitations, we omit the details 
here, but we arrive at similar conclusions in this case as in the 
two cases described above. 

From the three cases studied, we conclude that the effect of 
loss path multiplicity can be partially alleviated by using a rep- 
resentative scheme. However the average session bandwidth is 
sensitive to the choice of representatives and the choice of the 
rate adjustment algorithm. These choices are difficult since they 
must be tailored to the observed network loss conditions. At l!ic 
some time, we observe that these complications can be avoided 
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and max-min fair sharing of bandwidth can be achieved by hav- 
ing each multicast source choose its worst receiver as the single 
representative {K — 1). 

The LPM problem is not restricted to the family of FLIC A 
algorithms alone. It affects any source- based rate control al- 
gorithm that reduces the source's rate in response to Us from 
receivers without due considerntion of the existence of multiple 
loss paths in a multicast tree. In order to illustrate this, we have 
performed case studies with the strawman congestion avoidance 
algorithm proposed for the PGM protocol [7). Wc do not present 
the results here due to space constraints, but our results indicate 
lhai the impact of the LPM problem in that case is just as severe 
as in the case of the FLICA algorithms. 

VI. A Simulation Study of Bandwidth Sharing 

The results in the last section indicate that the LPM problem 
can severely reduce the average session bandwidth of a multi- 
cast session. Representative schemes can partially alleviate the 
problem, but may not be able to eliminate it altogether. At the 
same time, tracking the worst receiver is a promising approach 
for ensuring max-min fair bandwidth sharing. In this section, 
we explore this approach more carefully through simulations. 

An event-driven simulator has been used to simulate two sim- 
ple networks - a two-armed star and a two-link tandem network, 
that are shared by a number of unicast and multicast sessions. 
Every session, unicast or multicast, has an infinite data source 
and uses a FLICA algorithm with C ~ 8 and 5 = 500 msec. We 
assume that data packets are never reordered, though they may 
be lost due to buffer overflow at the gateways. Loss indications 
(Lis) are in the form of negative acknowledgment (NAKs). The 
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TABLE f 

Transmission rates (packets/second) of unicast and multicast 
sessions for simulations l. 2 and 3. 

reverse path used by these NAKs is different from the forward 
path for data packets and NAKs are never lost or reordered. Lost 
data is never retransmitted by the source, hence NAKs are used 
only for the purpose of loss detection und rale adjustment at the 
source. This corresponds to the no-retransmission data delivery 
model described earlier. We also assume thai the propagation 
delay on the reverse path is variable, but that the distribution of 
reverse path propagation delays is the same for all sessions. The 
bandwidth share of a session is measured in terms of the average 
transmission rate, r which is defined as follows. If the source 
transmits b packets in the interval then r ss fc/(l 3 - £j). 

A. Star Network 

The 2-armcd star network consists of gateways £1, (72 and 
GZ. connecting links Ll and £2, as shown in Figure 8. Each of 
links 1-2 and L3 has a bandwidth of 300 packets/second. Each 
gateway uses a FIFO service discipline and has a buffer size 
of 75. All sessions have their source connected to Gl. Every 
unicast session has ils receiver connected to either G2 or (73, 
whereas every multicast session has a receiver connected to each 
of G2 and (73. Thus each multicast session consists of two non- 
overlapping end-to-end paths over LI and L2 respectively. 

Simulation 1 involves five multicast sessions spanning Ll and 
£2, five unicast sessions over Ll and five unicast sessions over 
L2. The loss indication filter (LIF) at every multicast source is 
designed such that all NAKs from the receiver attached to (72 
pass through while no NAKs from the other receiver do. In ef- 
fect, every multicast session tracks NAKs from only one of two 
equally congested paths. All the sessions axe allowed to transmit 
packet 1 ; for 2200 seconds, with the session starting times being 
staggered over the first second. Table I shows the mean, maxi- 
mum and the minimum value of the average transmission rates 
for three groups of sessions : the five multicast sessions, the five 



unicast sessions over Ll and the five unicast sessions over L2. 
The measurement interval is taken to be [200 sec, 2000 sec). ty c 
observe that each session receives approximately the same share 
of bandwidth on both Ll and L2, implying that it is possible tn 
achieve, or at least approach, max- min fair bandwidth sharine 
in this case. 

In simulation 2, five additional unicast sessions are started on 
Ll, [hereby making it more congested than L2. Each mulii. 
cast session still uses the same LIF as in simulation 1, thereby 
tracking the more congested path of its two paths. We observe 
(Table I) that on Ll, all sessions (unicast or multicast) receive 
approximately an equal share (as 20 packets/sec) of the bottle- 
neck bandwidth. There is less traffic on L2, hence more avail- 
able bandwidth, however each multicast sessions is constrained 
to consume w 20 packets/sec on all of its end-io-end paths. This 
leaves an available bandwidth of about 200 packets/sec of band- 
width available on L2, which is then shared equally among the 
five unicast sessions traversing that link. Thus by using the same 
control algorithm at every source and by determining a multicast 
session's share by ils most congested path, mnx-min fairness has 
been realized. 

Simulation 3 differs from simulation 2 in that the LIF for each 
multicast session lets through NAKs only from the receiver at- 
cached to (73 and filters all NAKs from the one attached to (72... 
Hence each multicast session now regulates its rate according to 
the less congested of its two paths. We observe that L2 T s band- 
width is shared equally among the five multicast sessions and the 
five unicast sessions traversing it. But due to this, every multi- 
cast session is able to attain a rate of about 30 packets/sec uvcr 
Ll as well. This leaves each unicast session on Ll with a share 
of about 17 packets/sec and max-min fairness is noi realized. 
This observation emphasizes the importance of each multicast 
session being able to correctly identify ils most congested path. 
However, the available bandwidth on different paths of a multi- 
cast tree may be time variant; hence, a one-time identification of 
the most congested path may not be not sufficient. A multicast 
source has to monitor all its end-to-end paths, determine which 
one is currently the most congested and then choose a receiver 
at the end of that path to track. 

We next outline the design of an idealized protocol for do- 
ing this. In this protocol, every receiver in a multicast group 
to monitor packet losses on Its end-to-end path and maintains a 
loss probability estimate p. On receiving packet i, this estimate 
is updated using an exponential smoothing filler : 



{(1 - a)pi -r a, if packets lost, 
(1 - a)pi if packet i received. 



If 

-rift** 



m 



(10) 



where a is the gain factor of the filter. Every receiver period- 
ically reports the value of p to the source. The source uses a 
Pass- Worst LIF that remembers the identity of the receiver cur- 
rently reporting the highest value of p. and allows only NAKs 
from thai receiver to pass through. A change in the congestion 
condition on any end-to-end path is reflected in the value of p 
reported by a receiver at the end of that path. Hence the source 
is able to detect such changes and always regulate its rate ac- 
cording to the worst end-to-end path. Nole that the exponential 
smoothing filler is one of many possible ways of estimating the 
loss probability; a sliding or a jumping window may be used 
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TRANSMISSION RATES (PACKHTS/SECOND) OF UN1CAST AND MULTICAST 
S ESS I OSS FOR SIMULATION 4. 

instead. The tradeoffs in using these various approaches is an 
open research problem. 

Simulation 4 illustrates how such an LIFcan ensure max-min 
fair sharing of bandwidth, even under chancing network con- 
ditions. In this simulation, the setting is initially identical to 
simulations 2 and 3, hence Ll is the more congested of the two 
links. However, at t - 1200 second, a set of ten unicast ses- 
sions are started on L2> making it more congested than II. The 
value of a has been chosen as 0.01. From the result of Table II it 
is clear that max-min fair sharing of bandwidth both before and 
after the onset of additional congestion on 12. This is made pos- 
sible because the LIF at each multicast source is able to always 
identify the most congested path. So for the interval [200 sec, 
1200 sec], it identifies the receiver attached to t?l as the worst, 
but by t == 1400 sec, it has switched to the receiver attached to 
G2. 

In addition to the star network, we performed simulations 
with a two- link tandem network (K] to examine the effect or spa- 
tial loss correlation on the behavior of the proposed Worst- Pass 
LIP. Our results indicate that even with spatial loss correlation, 
the bandwidth share of each multicast session is commensurate 
with the loss probability on it most congested end-to-end pDth, 
and max-min fairness is realized. 

VII. Conclusions and Future Work 

In this paper, we have identified and studied the problem of 
loss path multiplicity that arises in the case of source-based mul- 
ticast congestion algorithms. Our study indicates that, unless 
due attention is paid to the existence of multiple loss paths in 
a multicast tree, a multicast session's share of bandwidth may 
be severely reduced. As a result, max-min fair sharing of band- 
width among multicast and unicast sessions cannot be realized. 
Representative schemes may alleviate the LPM problem par- 
tially, but may not be able to eliminate it completely. We have 
also identified an approach for ensuring ma* -min fairness, in 
which every multicast source identifies the iossiest receiver (and 
hence, the most congested path) in its multicast tree and regu- 
lates its rate according loss indications from that receiver. We 
have described on idealized protocol for identifying and track- 
ing the worst receiver in the presence of changing congestion 
levels in a network. 

There arc many issues that remain open for future research. 
The design of n practical protocol for tracking the worst receiver 
ina multicast group requires careful consideration of the issues 
uf estimating end-to-end loss probabilities and the timcscalcs of 
congestion. A detailed discussion of these issues can be found in 
18]. The issue of fairness in bandwidth sharing is a challenging 



problem and criteria like TCP-friendliness ([2,3, 10, 11)) and 
inter-receiver fairness [12] may have to considered when defin- 
ing fairness. The effect of feedback delay on source-based rate 
control techniques remains to be studied. Finally, we intend 
to explore how to extend our approach of tracking the worst 
receiver to design congestion control protocols for active net- 
works. 
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Abstract — As networked multimedia applications 
become widespread, it becomes increasingly im- 
portant to ensure that these applications can co- 
exist with current TCP-based applications. The 
TCP protocol is designed to reduce its sending rate 
when congestion is detected. Networked multime- 
dia applications should exhibit similar behavior, if 
they wish to co-exist with TCP-based applications 
[9]. Using TCP for multimedia applications is not 
practical, since the protocol combines error con- 
trol and congestion control, an appropriate com- 
bination for non-real time reliable data transfer, 
but inappropriate for loss-tolerant real time appli- 
cations. In this paper we present a protocol that 
operates by measuring loss rates and round trip 
times and then uses them to set the transmission 
rate to that which TCP would achieve under similar 
conditions. The analysis in [13] is used to determine 
this "TCP-friendly" rate. This protocol represents 
a first step towards developing a comprehensive pro- 
tocol for congestion control for time-sensitive mul- 
timedia data streams. We evaluate the protocol 
under various traffic conditions, using simulations 
and implementation. The simulations are used to 
study the behavior of the protocol under controlled 
conditions. The implementation and experimenta- 
tion involve over 300 experiments over the Inter- 
net, using several machines in the US and UK. Our 
experimental and simulation results show that the 
protocol is fair to TCP and to other sessions run- 
ning TFRCP, and that the formula-based approach 
to achieving TCP-friendliness is indeed practical. 

I. Introduction 

Networked multimedia applications usually employ 
non-TCP protocols (usually UDP with some applica- 
tion level control) to transmit continuous media (CM) 

This material is based upon work supported by the National 
Science Foundation under Grant Nos. CDA-9502639 and NCK- 
9508274. Part of the work was done when the first author 
worked for Nokia Research. 



data such as audio and video. As these applications 
become widespread, it becomes increasingly impor- 
tant to ensure that they are able to co-exist with each 
other and with current TCP-based applications. A 
key requirement of such a co-existence is the imple- 
mentation of some form of congestion control that re- 
sults in a reduction of transmission rate in the face of 
network congestion. Many current CM applications 
simply transmit data at the rate at which it was en- 
coded, regardless of the congestion state of the net- 
work. 

Two major considerations come into play when de- 
signing a congestion control protocol for CM appli- 
cations. First, because these applications are both 
loss-tolerant and time-sensitive, the transmission rate 
might best be adapted in a manner that is cognizant 
of the toss resilience and timing constraints of the ap- 
plication [14, 18]. Second, since the applications must 
co-exist with TCP-based applications, the congestion 
control algorithms should adapt their rate in a way 
that "fairly" shares congested bandwidth with TCP 
applications. One definition of "fair" is that of TCP 
"friendliness" [9] - if a non-TCP connection shares a 
bottleneck link with TCP connections, traveling over 
the same network path, then the non-TCP connec- 
tion should receive the same share of bandwidth (i.e., 
achieve the same throughput) as a TCP connection. 

To develop a comprehensive CM congestion con- 
trol protocol, one can begin by designing a congestion 
control protocol that sets the transmission rate in a 
TCP-friendly manner. Once such a "strawman" or 
baseline protocol is designed, it can then be modified 
to support the timeliness requirements of CM data, 
perhaps with some loss of "friendliness." The design of 
the "strawman" TCP-friendly protocol must be flexi- 
ble enough to allow such modifications. This require- 
ment for flexibility rules out the use of TCP itself as 
the baseline protocol. The congestion control mech- 



anisms of TCP are tightly coupled with the mecha- 
nisms that provide reliable delivery, an appropriate 
combination for non-real time reliable data transfer, 
but inappropriate for loss-tolerant time-sensitive CM 
applications. In this paper, we propose a simple base- 
line TCP-friendly rate control protocol (TFRCP) that 
does not couple error-recovery and congestion control, 
and retains sufficient flexibility for later modifications. 

We present a congestion control algorithm that con- 
trols the sending rate in a manner that is roughly 
equivalent to that of TCP. Specifically, if a TCP 
connection achieves throughput X under given net- 
work conditions and measured over a given interval 
length, then the proposed protocol should also have a 
throughput of X over an interval of the same length 
and under the same network conditions. Note that 
the throughput A' has to be measured over some 
time interval, and based on the definition of "TCP- 
Friendliness" proposed in [9], we assume that this in- 
terval is significantly larger than the round trip time. 
The actual transmission rate, X, is determined by us- 
ing a model-based characterization of TCP through- 
put in terms of network conditions such as mean round 
trip time and loss rate. We base our protocol on the 
model proposed in [13]. In [23] the authors have pro- 
posed a similar approach for multicast congestion con- 
trol, using the formula proposed in [9]. Our protocol 
differs form theirs in that we use a more accurate char- 
acterization of TCP and unlike [23] we do not require 
the use of data layering. Other TCP-friendly base- 
line protocols that, try to mimic the major features 
of TCP congestion control algorithm without provid- 
ing reliable delivery have been proposed [7, 17,20. 24], 
Some ongoing work, based partially on our ideas, with 
a focus on formula-based multicast congestion control, 
is also reported in [5. 6]. We discuss some of these pro- 
tocols and their limitations in the next section. 

We believe there are several advantages to taking 
a formula-based approach towards developing a TCP- 
friendly congestion control scheme. First, a formula 
based approach is flexible. By changing the formula, 
one can easily adjust the performance of the proto- 
col. This feature can later be exploited for making 
the protocol sensitive to the timeliness requirement 
of the media being transported. In addition, if TCP 
and non-TCP flows are treated separately in the net- 
work (perhaps using a scheme such as [2]), then the 
formula- based approach can be modified to allow non- 
TCP flows to compete only against one another. Fi- 
nally, in [23], it has been shown that such an approach 
is more suitable for multicasting. Thus, a formula- 



based approach based on an abstract TCP characteri- 
zation can be viewed as a first step towards developing 
a comprehensive solution to the problem of congestion 
control for CM flows. 

We evaluate the protocol under various traffic con- 
ditions, using simulations and implementation. The 
simulations are used to study the behavior of the 
protocol under controlled conditions. The implemen- 
tation and experimentation involve over 300 exper- 
iments over the Internet, using several machines in 
the US and UK. Our experimental and simulation re- 
sults show that the protocol is fair to TCP and to 
other sessions running TFRCP, and that the formula- 
based approach to achieving TCP-friendliness is in- 
deed practical. 

The rest of this paper is organized as follows. In 
Section II, we present an overview of related work re- 
ported in the literature, followed by a description of 
our protocol and its advantages. In Section TIL we 
present simulation studies of our protocol. In Section 
IV, we present results from a "real-world" implemen- 
tation of the protocol. In Section V we discuss some of 
our design choices. Section VI concludes the paper. 

II. Rate Adjustment Protocols 

Several TCP-friendly rate adjustment protocols 
have recently been reported in the literature [7, 17, 
20,23,24]. Of these, [23,24] are specific to multicast 
applications, while [7, 17,20] are unicast oriented. We 
now briefly review each of these five schemes, describe 
the new TFRCP protocol, and show how it overcomes 
some of the limitations of earlier work. 

/I. Previous Work 

In [7], authors describe a protocol that may be clas- 
sified as a :t TCP~Exact" approach. They propose a 
protocol which manages its window size in exactly the 
same way as TCP does, but instead of retransmitting 
lost packets, it allows the user to send new data in 
each packet. The principle concern with this protocol 
is its inflexibility. Since the protocol strictly adheres 
to TCP window dynamics, it would be hard to modify 
it to take into account timeliness requirements of CM 
data delivery. 

The TCP-friendly protocols reported in [17,20,23, 
24] are based (either explicitly or implicitly) on the 
TCP characterization first reported in [9] and later 
formalized in [10, 12]. This characterization states 
that in absence of timeouts, the steady state through- 



put, of a long-lived TCP connection is given by: 

C 

Throughput, = R ^ (1) 

where C is a constant, that is usually set to either 
1.22 or 1.31, depending on whether or not receiver 
uses delayed acknowledgments, R is the round trip 
time experienced by the connection, and p is the ex- 
pected number of window reduction events per packet 
sent. Note that the throughput is measured in terms 
of packets/unit time. Also note that p is not the 
packet loss rate, but is the frequency of loss indica- 
tions per packet sent [10]. The packet loss rate pro- 
vides an upper bound on the value of p, and may be 
used as an approximation. The key assumption be- 
hind the characterization in (1) is that timeouts do 
not occur at all. Consequently, it is reported in [10] 
that (1) is not accurate for loss rates higher than 5%. 
As the formula does not account for timeouts, it typi- 
cally overestimates the throughput of a connection as 
loss rate increases. Data presented in [10, 13] shows 
that timeouts account for a large percentage of win- 
dow reduction events in real TCP connections, and 
that they affect performance significantly. 

In [23] the authors propose a multicast congestion 
control scheme in which the data is transmitted in a 
"layered" manner over different multicast groups. The 
more layers a receiver joins, the more data it receives. 
In [23] the receivers compute round trip times and es- 
timate the packet loss rate p, and use (1) to compute 
the :t TCP-friendly n rate at which they should receive 
the data. Based on this estimate, and the knowledge 
of the layering schemes, each receiver can dynamically 
decide to join or leave certain multicast groups to ad- 
just the rate at which it receives the data. In [24], the 
authors propose a similar scheme in which the layers 
have data rates that are fixed multiples of a base rate, 
and a TCP-like effect (additive increase, multiplica- 
tive decrease) is achieved by using strict time limits 
on when a receiver might join or leave a group. The 
analysis of the algorithm yields a throughput charac- 
terisation that is similar to (1). Apart from not being 
TCP-friendly at loss rates above 5%. both schemes 
rely on data layering, which is not easy to achieve for 
all types of CM encodings. In addition, determining 
round trip times in a multicast setting is a difficult 
task, as noted in [23]. 

In [20] the authors propose a scheme that is suitable 
mainly for unicast applications, but may be modified 
for multicast applications. The scheme relies on reg- 
ular RTP/RTCP reports [19] sent between the sender 



and the receiver to estimate the loss rate and round 
trip times. In addition, they propose modifications to 
RTP that allow the protocol to estimate the bottle- 
neck link bandwidth using the packet-pair technique 
proposed in [1]. An additive increase/multiplicative 
decrease scheme based on these three estimates (loss 
rate, round trip delay, and bottleneck bandwidth) is 
then used to control the sending rate. The scheme 
has several tunable parameters whose values must 
be set by the user. In addition, the scheme is not 
"provably" TCP-friendly, although TCP-friendliness 
is evidenced in the few simulations reported in the 
paper. In [17] the authors propose an additive in- 
crease/multiplicative decrease rate control protocol 
that uses ACKs (in a manner similar to TCP) to es- 
timate round trip times and detect lost packets. The 
rate adjustment is done every round trip time. The 
authors also propose to use the ratio of long-term and 
short-term averagas of round trip times to further fine 
tune the sending rate on a per-packet basis. 

Although the protocols reported in [20] and [17] do 
not explicitly use (1) to control their rates, the work 
in [9, 10, 12] has shown that the relationship between 
loss rate and the throughput of thase protocols will be 
similar to (1). As a result, these protocols will not be 
"TCP-friendly" at loss rates higher than 5%. While 
[20] ignores this problem, in [17] the authors mention 
that their work is targeted towards a future scenario 
in which SACK TCP [3] and RED [4] switches will be 
widely deployed, reducing the probability of timeouts. 
However, in the present Internet, TCP-R.eno [21] is 
the predominant protocol and very few RED switches 
have been deployed. 

In the next section we propose a new protocol that 
achieves TCP friendliness in a more "real world" sce- 
nario that includes competing TCP-Reno connections, 
drop-tail switches and diverse background traffic con- 
ditions. 

B. The TFRCP Protocol 

The TFRCP protocol is a rate-adjustment conges- 
tion control protocol that is based on the TCP char- 
acterization proposed in [13]. Unlike [9,10,12], the 
characterization in [13] takes into account the effects 
of timeouts, a consideration that is particularly im- 
portant when TCP-Reno (one of the most widely de- 
ployed versions of TCP) is used with drop-tail routers, 
which tend to produce correlated losses. If a TCP- 
Reno connection encounters correlated losses, it tends 
to experience a significant number of timeouts [3]. In 
[13] the authors quantify this phenomenon and its ef- 



fects on throughput. The resulting analytic charac- 
terization of TCP throughput can stated as follows: 

Throughput * f{W m ^ R..p, B) (2) 

where throughput is measured in packets per unit 
time, Wniax is the receiver's declared window size. R 
is the round trip time experienced by the connection, 
7; is the loss rate (or, more accurately, the frequency 
of loss indications per packet sent) and B is the base 
timeout value [21]. A complete statement of the for- 
mula is presented in the Appendix. 

There are two parts to the TFRCP protocol: a 
sender-side protocol and a receiver-side protocol. The 
sender-side protocol works in rounds of duration M 
time units. We call M the recomputation interval. At 
the beginning of each round, the sender computes a 
TCP-friendly rate (we will shortly describe this com- 
putation in detail), and sends packets at that rate. 
Each packet carries a sequence number and a times- 
tamp indicating the time the packet was sent. The re- 
ceiver acknowledges each packet, by sending an ACK 
that carries the sequence number and timestamp of 
the packet it is acknowledging. Consider an ACK for 
a packet whose sequence number is k. In addition to 
the sequence number and the timestamp. the ACK 
also carries a bit vector of 8 bits indicating whether 
or not each of the previous 8 packets (fc - 7 . . . k) was 
received. The sender processas these ACKs to com- 
pute sending rate for the next round. Note that each 
packet is ACKd eight times, providing some protec- 
tion against ACK losses. 

Let us now consider the sending rate computation 
in detail. Consider round i. Let n be the sending rate 
for this round. 7?. be the the current round trip time 
estimate, and B be the estimate of the base timeout 
value. The number of packets to be sent in this round 
is ni = r t - * M. The n; packets are clocked out uni- 
formly during the round 1 . As noted earlier, packets 
carry a sequence number and a timestamp indicating 
the time the packet was sent. The sender keeps a 
log of all packets it has sent in this round. The log 
contains two entries for each packet. The first entry 
indicates whether the packet has been (i) received and 
has been acknowledged by the receiver; (it) presumed 
lost; (Hi) of unknown status (neither ACKd nor yet 
presumed lost). We call this the "received status" of 
the packet. The second entry consists of a value that 

1 In simulation studies, it is possible dock out packets evenly 
over the entire duration of the round. This is not possible in 
actual implementation, due to limited accuracy of timers. We 
discuss this further in Section IV. 



is equal to the time the packet was sent plus the cur- 
rent base timeout value. We call this the "timeout 
limit" for the packet. 

As the sender sends packets, it also receives ACKs 
from the receiver. Consider an ACK carrying se- 
quence number k that is received by the sender at 
time t k . Let the timestamp carried by the ACK be 
,s fc . The sender updates the lost/received status of 
packets {k - 7. . . k) using the bit vector available in 
the ACK. The sender also updates the round trip time 
estimate (R) and base timeout (B) using the differ- 
ence if. — .sjfc. This update is done exactly as in TCP; 
see [22] for the details of the computation. At the end 
of the i lh round, the sender computes r^\ as follows: 

Let the current time be t{. Let j be the packet 
with the smallest sequence number, whose received 
status was "unknown" at the end of round i - 1, I 
be the last packet sent and a be the highest sequence 
number for which we have received an ACK. Then any 
packet whose sequence number lies between j and /, 
(both included) and whose timeout limit is less than 
fa is marked as lost. Also, any packet whose sequence 
number lies between j and a (both included), and 
whose received status is "unknown" is marked as lost. 
Let xi be the number of packets marked as "received" 
between j and a, and let yi be the number of packets 
marked as "lost" between j and a. Then: 
e If yi = 0. then no packets were lost and: 

r/ +l = 2 *n 

Hence, when no packets are lost in a round, packets 
are sent twice as fast in the next round. We will dis- 
cuss this feature more in Section V. 
e Otherwise, yi ^ 0. Let pi = ^jt- In this case, the 
rate for round i 4- 1 is 

ri+i = f{W mttXj R % pi,B) 

where / is defined in (2). It is here that the analytic 
characterization in [13] comes into play. 

The starting value rn, can be set to any reasonable 
value. We have found that for sufficiently long flows, 
and for reasonable values of M y the value of ro has lit- 
tle impact on the performance of the protocol. For all 
simulations and experiments described in this paper, 
we set this value to 40 packets/second. The initial 
values of R and B are set in a manner similar to TCP 
[22]. 

TFRCP has no built-in error recovery mechanisms. 
When a comprehensive congestion control protocol, 
based on TFRCP is developed, the applications will 



be able to choose an error control strategy that is 
appropriate for the given media type. An important 
feature of any transmission control protocol is "self- 
limitation" [17]. This means that if the protocol starts 
experiencing 100% or near 100% losses, its sending 
rate should be reduced to almost zero. TCP achieves 
this via timeouts and eventual closedown of the con- 
nection. The TFRCP protocol uses the model pro- 
posed in (13], which takes into account the effect of 
timeouts and automatically reduces the sending rate 
to very small values at high loss rates. 

The key question is how frequently the sender 
should re-compute the rate, i.e., how to determine the 
value of M. In the following section we use simula- 
tions to explore various strategies for choosing M, and 
their impact on the performance of the protocol. 

III. Simulation Results 

In this section we present simulation studies of the 
TFRCP protocol. The simulations are used to study 
the behavior of the protocol under controlled con- 
ditions. In the following section we present addi- 
tional studies carried out over the Internet. We have 
used the ns simulator [11] for our simulations. There 
are two main challenges for any simulation study of 
this nature: first, how to select appropriate network 
topologies and how to effectively model the back- 
ground traffic and second, how to define and measure 
appropriate performance metrics. Several difficulties 
in this regard are pointed out in [16]. Thus, before we 
present any simulation results, we discuss our simula- 
tion topology and our performance metrics.' 

A. Simulation Topology 

In our simulations, we use a simple topology to un- 
cover and illuminate the important issues; our exper- 
iments with TFRCP over the Internet test its use in 
:t real-world" scenarios. The simulated network topol- 
ogy assumes a single shared bottleneck link, as shown 
in Figure 1. The sources are arranged on one end of 
the link and the receivers on the other side. All links 
except the bottleneck link are sufficiently provisioned 
to ensure that any drops/delays that occur are only 
due to congestion at the bottleneck link. All links 
are drop-tail links. Many previous studies [3,4,17. 
20] have used similar topologies. 

The problem of accurately modeling background 
traffic is more difficult. We consider three types of 
background traffic: infinite-duration FTP-like connec- 
tions, medium-duration FTP -like connections and 
self-similar UDP traffic. The infinite-duration FTP 




Fig. 1. Simulation Topology 



connections allow us to study the steady-state be- 
havior of our protocol. Medium-duration FTP con- 
nections introduce moderate fluctuations in the back- 
ground traffic. Finally, self-similar UDP traffic is be- 
lieved to be a good model for short TCP connections 
such as those resulting from web traffic [15,25]. 

When multiple TCP connections are simulated over 
a single bottleneck link, the connections can become 
synchronized. We take two measures to prevent such 
synchronization. First, we start the connections at 
slightly different times. Second, before each packet is 
sent out, a small random delay is added to simulate 
processing overhead. These measures are applied to 
both TCP and TFRCP connections. 

B. Performance. Metrics 

Recall that we view TFRCP protocol as only a first 
step towards developing a comprehensive congestion 
control protocol for CM data flows. Thus, we are 
only interested in measuring the "TCP-friendiinass" 
of the TFRCP protocol. We define the "friendliness" 
metric as follows. Let k c denote the total number 
of monitored TFRCP connections and k t denote the 
total number of monitored TCP connections. We de- 
note the throughput of the k c TFRCP connections 
by Tf, T 2 C , . . . T£ and that of the TCP connections by 
T\, 7^, . . . T£ t respectively. Define: 

v*^c rpc V^fcf rpt 

Tc = idsLZL and T T = T 1 '' 
k c kt 

The performance metric of interest is the "friendliness 
ratio", F: 

F = Tc/Tr 

Another metric for measuring performance is the 
"equivalence ratio" , B: 

E = max(T T fTc,Tc/T T ) 



Note that the value of E is always > 1. B gives a 
better visual representation of the closeness of the 
throughputs achieved by the two protocols. How- 
ever, this metric will distort any trend that might be 
present in the ratio of the two throughputs as we vary 
various parameters. For example, a decreasing value 
of F as a function of some system parameter will not 
always result in a decreasing value of E. Thus, we use 
F as the fairness metric whenever we are interested 
in trends, and use E otherwise. It is also important 
that the TFRCP connections achieve fairness amongst 
themselves. We define the ratio: 

min|<£<jt c ?7 

to characterize the fairness achieved among the 
TFRCP connections. 

C. Simulation Scenarios 

We now present results of performance evaluation 
of TFRCP protocol in various simulation scenarios. 

C.l Long duration flows with constant bottleneck 
bandwidth 

In this scenario we consider traffic made up entirely 
of equal numbers of infinite-duration TCP connec- 
tions and infinite-duration TFRCP connections. All 
connections always have data to send. All connec- 
tions start at the beginning of simulation and last 
until the simulation ends. The aim here is to study 
steady state behavior of TFRCP protocol. If TFRCP 
performs well {i.e.. in a TCP-friendly manner), the 
TCP and TFRCP connections should see approxi- 
mately the same throughput. 

We vary the total number of flows in the net- 
work between 10 and 50. Half of these connec- 
tions are TCP connections and the rest are TFRCP 
connections. The initial sending rate, ro, for all 
TFRCP connections was set to approximately 40 
packets/second. The bottleneck bandwidth is held 
constant at 1.5Mbps, and the bottleneck delay is set 
to 50ms. This roughly simulates a situation in which 
a number of connections share a Tl link. As the num- 
ber of flows grows, the window sizes of individual TCP 
connections shrink, increasing the probability of time- 
outs. In such circumstances, the congestion control 
protocols proposed in (17, 20] are not be able to guar- 
antee fairness. 

We consider three different ways to determine how 
frequently TFRCP should recompute its rate: 



o Fixed recomputation interval, i.e. we use a fixed 
value for M. We call this strategy SI. 
o The recomputation interval is a fixed multiple of 
round trip time. If at the beginning of round i the 
round trip time is rttu then the next recomputation 
is performed after K * rtti time units, where K is 
constant. We call this strategy S2. 
q The recomputation interval is calculated at the be- 
ginning of each round, and is set to sum of two num- 
bers, one of which is a constant while the other is cho- 
sen from a uniform random distribution. This strat- 
egy will further prevent TFRCP connections from 
synchronizing with each other. We call this strategy 
S3. 

In Figure 2(a) we present simulation results for the 
case in which the TFRCP protocol uses strategy SI, 
with five values of M between 2 and 5 seconds. The 
length of each simulation was 1000 seconds, and the 
throughput of all connections was measured at the 
end of the simulation. Each data point is an aver- 
age of three experiments. It can be seen that with 
steady state background traffic, the protocol is able 
to maintain a friendliness ratio close to L 

In Figure 2(b) we present simulation results when 
TFRCP protocol uses strategy S2, with four values 
of K between 10 and 60. We notice that as the load 
on the network increases, the resulting TFRCP be- 
havior is more aggressive than TCP. As the load on 
the network increases, the round trip time experienced 
by each flow also increases. As a result, each TFRCP 
flow re-computes its rate less frequently. TCP reduces 
its transmission rate multiplicatively every time it en- 
counters a loss, and increases it only additively in case 
of no loss, thus the slowness of response of TFRCP 
flows to react to losses hurts the throughput of TCP 
connections. Thus TFRCP is more aggressive, and 
clearly S2 is not an appropriate strategy for deciding 
recomputation intervals. 

In Figure 2(c) we present simulation results where 
TFRCP protocol uses strategy S3. For each line we 
use a different constant and a different uniform ran- 
dom distribution: 0.3 + [0,5.4], 1.5 + [0,3] and 2.7 + 
[0,0.6]. For this simulation study, all TFRCP connec- 
tions were started simultaneously. It can be seen that 
in this third case the protocol is able to maintain a 
friendliness ratio close to 1. 

We have performed simulations with other bottle- 
neck delays and observed similar results. In the rest 
of this section we only present results using strategy 
SI. We do this for two reasons. First, strategy Si 
is the simplest strategy. The goal of this paper is to 
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present, TFRCP protocol as a baseline policy; use of 
a simple policy to decide the recomputation interval 
is consistent with that goal. Second, the question of 
selecting the appropriate recomputation interval re- 
quires more complex answers than the three simple 
strategies described here. The recomputation interval 
must be short enough to allow TFRCP to be respon- 
sive, while at the same time it must be large enough 
to allow the loss rate measurements to be meaningful. 
This question is currently under research [5, G]. Thus, 
it is appropriate to restrict the baseline protocol de- 
scribed here to the simplest strategy. 

Recall that the TFRCP connections should be fair 
to each other as well In Figure 3 we plot the value 
of FC when the TFRCP protocol uses strategy Si. 
It can be seen that the TFRCP protocol achieves ac- 
ceptable fairness among TFRCP connections in most 



cases, 

C.2 Long duration flows with constant bottleneck 
bandwidth share 

In this scenario, the traffic is made up of infinite- 
duration TCP connections and infinite-duration 
TFRCP connections. All connections start at the be- 
ginning of the simulation and last until the end. We 
vary the total number of flows in the network between 
10 and 50. The bottleneck bandwidth is computed by 
multiplying the total number of flows by 4Kbps. The 
buffer size at the bottleneck link was set in each case 
to four times the bandwidth-delay product. These set- 
tings of packet and buffer sizes allow the TCP connec- 
tions to have "reasonable" window sizes [17] and ex- 
hibit the full range of behavior such as slow start and 
congestion avoidance. Each experiment is repeated 
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for various values of the recomputation interval, M. 
The initial sending rate, r 0 , For all TFRCP connec- 
tions was set to approximately 40 packets/second. 
In Figures 4(a) and 4(b) we plot F and FC (TCP- 
friendliness and Fairness among TFRCP connections) 
for this scenario when the bottleneck delay was 50ms 
and the TFRCP connections used strategy SI. It 
can be seen that with steady state background traf- 
fic, the protocol is able to maintain a friendliness ra- 
tio close to 1, and the TFRCP connections are fair 
among themselves as well. We performed simula- 
tions with bottleneck delay of 20ms and 100ms as well 
(not shown here), and found that for small bottle- 
neck delays, TFRCP behaves more aggressively than 
TCP. We conjecture that this is due to the fact that 
with lower round trip times, TCP reacts to losses and 
small changes in traffic fluctuations more quickly. At 
higher round trip delays (100ms) the performance of 
the TFRCP protocol for small values of M (< 3 sec- 
onds) shows high variance, as the protocol is unable 
to gather sufficient samples to estimate loss rates ac- 
curately. 

C.3 Dynamically Arriving Medium-duration FTP 
Connections 

In this simulation scenario, we study the effect of 
"slow" changes in the background traffic. Recall that 
in the simulations described so far, traffic consisted 
of infinite TCP and TFRCP connections. We now 
consider the case that there is one infinite-duration 
TCP connection, one infinite-duration TFRCP con- 
nection, and additional traffic consisting of dynami- 
cally arriving TCP connections, each of which trans- 
fers a fixed amount of data. In computing F, we con- 
sider only the two infinite-duration connections. The 



bottleneck link bandwidth is set to 1.5Mbps and the 
bottleneck delay is set to 50ms. The duration of sim- 
ulation is 1000 seconds. /The amount of data trans- 
ferred by each background connection is chosen from 
a uniform distribution. The interarrival times for the 
medium-duration FTP connections are chosen such 
that on average a constant number of background con- 
nections will be active. A higher average number of 
background connections leads to more fluctuations in 
the background traffic, and in addition, the window 
size of each TCP connection tends to be smaller (due 
to a smaller bandwidth share), increasing the possi- 
bility of timeouts. We are interested in the perfor- 
mance of TFRCP protocol as the average number of 
background connections change. For graphs in Fig- 
ures 5(a) and 5(b) the data transferred by each con- 
nection is chosen from [0. 80A'B] (average 40KB) and 
[0, 160/f B] (average 80KB), respectively. 

The results in Figure 5 show that TFRCP main- 
tains a friendliness ratio of approximately one with 
a recomputation interval M — 2 seconds. The ra- 
tio decreases as the recomputation interval becomes 
larger. We conjecture that this behavior is due to the 
nature of the background traffic. As old connections 
terminate and new ones start, there are small periods 
of time during which the background traffic decreases 
slightly as the new connections go through their slow 
start phase. TCP is better able to take advantage of 
these small drops in the background traffic, due to its 
faster feedback mechanism. The TFRCP connection 
changes its sending rate only every M seconds, and 
hence is unable to take advantage of short-term drops 
in the background traffic. 

C.4 ON/OFF UDP traffic 

In this simulation scenario, we model the effects 
of competing web-like traffic (very small TCP con- 
nections, some UDP flows). It has been reported in 
[15] that WWW-related traffic tends to be self-similar 
in nature. In [25], it is shown that self-similar traf- 
fic may be created by using several ON/OFF UDP 
sources whose ON/ OFF times are drawn from heavy- 
tailed distributions such as the Pareto distribution. 
Figure 6 presents results from simulations in which 
the "shape" parameter of the Pareto distribution is 
set to 1.2. The mean ON time is 1 second and the 
mean OFF time is 2 seconds. During ON times the 
sources transmit with a rate of 12Kbps. The number 
of simultaneous connections is varied between 20 and 
80. The simulation was run for 25000 seconds. As in 
the previous subsection, there are two monitored con- 
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nections, an infinite TCP connection and an infinite 
TFRCP connection (i.e. k = 2). The bottleneck link 
bandwidth is set to 1.5Mbps and the bottleneck delay 
is set to 50ms. From the results in Figure 6, we can see 
that the TFRCP protocol is still relatively fair. The 
Fairness index again decreases as the recomputation 
interval, M, increases. We believe that this is due to 
the fact that the TFRCP connection recomputes its 
rate only after every M time units. Hence, it can not 
increase its sending rate during the small periods of 
time in which the background traffic drops in inten- 
sity. Results for other values of the shape parameter 
are similar, 

D. Summary of simulation results 

The simulations results presented in this section 
show that the TFRCP protocol is l TCP-friendly" un- 
der a wide variety of traffic conditions. We found that 
the strategy to use a fixed value for recomputation in- 
terval (M), works well for a wide variety of traffic 
conditions. While the simulation study is based on 
several different traffic scenarios, it is important to 
observe the performance of the protocol in real world. 
In the next section we discuss the implementation and 
present results based on experiments carried out over 
the Internet. 

IV. Implementation and Experimental 
Results 

As noted in [16], simulating an Internet-like envi- 
ronment is very difficult. It is thus essential to test 
protocols like TFRCP via implementation and exper- 



imentation in a real-world setting. Our goal here is 
thus to show that the approach is practical, and that 
performance of the protocol under real-world condi- 
tions is comparable to that observed in the simula- 
tions. We have implemented a prototype version of 
our protocol and tested it on several Unix systems. 
In this section, we first describe the implementation, 
and discuss some of the difficulties encountered. We 
then present the results from over 300 experiments 
performed using this implementation. 

A. Implementation 

Our implementation of TFRCP is done in user 
space, on top of UDP. The sender side of TFRCP 
runs as two processes, one sending the data and the 
other receiving ACKs. The two processes communi- 
cate via shared memory. An earlier attempt to imple- 
ment the protocol using multiple threads failed, as the 
p-threads package could not provide sufficiently accu- 
rate scheduling control to avoid starving either the 
sender or the receiver thread. We were able to reuse 
much of the ns simulation code for the actual imple- 
mentation. However, we encountered three important 
problems during the implementation: 
• A significant problem in any actual implementation 
is the the accuracy of the various timers involved. For 
simulation purposes, we could time out packets with 
arbitrary precision. This is not possible in an actual 
implementation, as the timers are neither arbitrar- 
ily accurate nor are they free of overheads. In some 
of our early experiments we found that when using 
the gettimeof day and select system calls, we could 
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not control inter-packet interval more accurately than 
within several milliseconds. While we could achieve 
better accuracy using busy waiting in the process that 
sent the packets out, this can possibly starve the pro- 
cess that receives ACKs. On a FreeBSD machine 
used in this study, busy waiting caused other prob- 
lems that forced us to use gettimeof day and select 
system call in our FreeBSD implementation. Due to 
these difficulties, it is not possible to clock packets out 
smoothly over the duration of each rounds as men- 
tioned in Section II-B. Instead, we send packets out 
in small bursts. Consider round i. Let R be the round 
trip time estimate at the beginning of this round, and 
r; be the sending rate. The duration of the round is 
M time units. Then, the round is divided into bursts 
of duration R each. The number of bursts is thus. 
bi = M/R. In each burst, ni/bi (rounded to nearest 



integer) packets are sent back-to-back, followed by a 
silence period of R time units. 

• Another important problem was the accuracy of the 
round trip times. The TFRCP protocol begins mea- 
suring the round trip time for each packet as soon 
as it is handed to the kernel socket using sendto. 
Thus, our round trip times include the time each 
packet spent waiting in the kernel buffers (similarly 
for ACKs). Thus, our estimate of round trip times is 
higher than that of the in-kernel TCP's. In addition, 
due to additional difficulties with timers, we had to 
restrict the protocol to transmit at least one packet 
per round trip time. 

• In our simulation studies, it was easy to ensure that 
the packet sizes for TCP and TFRCP connections 
were the same. It is more complicated to ensure this 
in practice. Our implementation currently does not 
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employ any path MTU discovery algorithm, nor does 
it change the size of outgoing packets. For each exper- 
iment described in the next section, the packet size is 
held constant, determined by the MTU discovered by 
TCP in previous experiments between the same two 
hosts. While we have found that we seldom had prob- 
lems with the MTU value, it is hard to quantify the 
effects of constant packet size on throughput. 

As a result of the implementation considerations 
noted above, we expect the results from implementa- 
tion experiments to differ somewhat from the simula- 
tion experiments. However, we can still use the imple- 
mentation to corroborate the intuition gained through 
our simulations, to examine TFRCP performance in 
real-world setting, and to provide a starting point for 
a more refined implementation. 

13. ExjmHmeniai Results 

The hostnames, domains and operating systems of 
the machines used for the implementation study are 
listed in Table L To measure the fairness of TFRCP 
compared to TCP, we performed the following experi- 
ment. We established two connections between a pair 
of hosts. One of these connections was controlled by 
the TFRCP protocol, while the other was controlled 
by the TCP protocol. Both connections ran simulta- 
neously, and transferred data for 1000 seconds, as fast 
as possible. The length of the recomputation interval, 
M, for the TFRCP connection was set to 3 seconds; 
the receivers declared window size, W max , was set to 
100 packets; and the initial sending rate, r 0 , was set 
to approximately 40 packets /second. The throughput 
of the two connections was measured in terms of num- 
ber of packets transferred in these 1000 seconds. Let 
us denote these throughputs Tc and TV respectively. 

Figures 7(a)-7(c) show the results when the senders 



were void, manic and bmt respectively. Since we are 
not interested in trends along the :r-axis, we use E 
as our performance metric. For each of the first five 
bars, the a;-axis shows the receiver. To plot this graph, 
at least 15 experiments were performed between the 
sender and the receiver at random times during the 
day and night 2 , and for each experiment the value 
of E was computed. It is suggested in [8] that data 
from such experiments should be represented by its 
median, and that the variation be represented by the 
semi-inter quartile range (SIQR), defined as half of the 
difference between the 2b lh and the lb th percentiles of 
the data set. Thus, the height of each bar is the me- 
dian of that data set, while the the bar represents the 
SIQR, centered about the median. The last bar rep- 
resents the median and the SIQR of all experiments. 

It can be seen that in most cases, the TFRCP pro- 
tocol achieves a throughput that is within 35-50% of 
the TCP throughput and that the difference seldom 
exceeds 75%. The median of all three data sets taken 
together is 1.448 and the SIQR is 0.275. There are 
many possible reasons for the observed difference be- 
tween the TCP and TFRCP throughputs. Some vari- 
ation is unavoidable - we have found that the through- 
put of two simultaneous TCP connections between the 
same hosts can differ by as much as 10%. Additional 
variation results from the various implementation dif- 
ficulties described earlier. And finally, one must re- 
member that the formula described in [13] is only an 
approximation. 

Figures 7(a)-7(c) are based on throughputs that 
have been computed over the entire duration of the 
experiment (i.e., 1000 seconds). It is also interesting 
to compare the difference in TCP and TFRCP as a 
function of time, and over shorter intervals of time. 
Such a comparison illustrates how well the TFRCP 
protocol performs at various time scalas. In Figure 8, 
we plot the throughput of the TFRCP and the TCP 
connections between manic and edgar. measured ev- 
ery 6, 12, 24 and 48 seconds respectively. It can be 
seen that TFRCP tracks variations in throughput of 
the TCP connection quite well, at various time scales. 

To measure the sensitivity of the protocol to the 
interval over which we measure the loss rate and 
update the sending rate (i.e. the value of M), we 
performed several data transfers between the same 
sender-receiver pair, using different measurement in- 
tervals. We now use F as our fairnass metric, as we 
are interested in the trend in the performance metric 

2 Experiments with bmt as a .sender were performed only dur- 
ing the day. 
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,'is we vary M, In Figure 9 we show the results of one 
such study, performed between void and alps. The 
measurement interval was varied between between 2 
and 10 seconds. For each value of measurement in- 
terval, at least 20 experiment were conducted at ran- 
dom times. The data points are the medians of the 
throughput ratios, and the error bars represent the 
SIQR. One can see that as the measurement interval 
grows larger, the TFRCP protocol becomes less ag- 
gressive. This result is consistent with the simulation 
results presented in the previous section. From the 
results presented in this section, we can conclude that 
the protocol indeed performs well in a real world set- 
ting, despite the limitations and difficulties imposed 
by various implementation issues. 



V. Discussion of Protocol Features 

In this section we discuss the impact of some of the 
design choices made while simulating and implement- 
ing TFRCP. 

Recall that TFRCP doubles its sending rate when 
no packets are lost in an entire recomputation period, 
since the formula in (2) is not valid for zero loss rate. 
During periods of no loss, the TCP window grows lin- 
early (ignoring the initial slow start period), by one 
every RTT. Since the sending rate is proportional to 
the window sixe, one can say that the sending rate 
of TCP grows linearly during periods of no loss. We 
found that when we try to mimic this linear increase 
behavior in TFRCP, the protocol performed poorly 
(i.e., the friendliness ratio was higher). This is due to 
the fact that in most of our simulations and Internet 
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experiments, the fair share of the TFRCP connection fcion interval of no loss, we increased the rate in a linear 
tended to be relatively small. If, after a recomputa- fashion, the relative change in the rate was very high 



(e.g.. if no loss occurred during a three-second period, 
and if RTT was 100ms, the rate would increase by 
3/0.1 = 30). This led to very high losses in the next 
round, which in turn dropped the sending rate to a 
very low value, leading again to a no-loss, or low-loss, 
period. This oscillatory behavior was detrimental to 
the performance of the protocol. Doubling the send- 
ing rate seems to offer a good compromise between 
responsiveness (ramping up the sending rate quickly) 
and avoiding oscillatory behavior. 

The value of W mnx can significantly affect the 
throughput computed using the formula in (2) at low 
loss rates. While in the simulations studies it is easy 
to ensure that competing TCP and TFRCP flows had 
the same value for W ma x> this is hard to ensure in 
practice. We note that this problem is inevitable 
whenever flow control is employed: two TCP con- 
nections experiencing same network conditions, but 
having different values for W mtlx% will have different 
throughputs. 

Another design issue is how to set the initial value 
for ro- For the simulation and implementation results 
reported in this paper, we set this value to approxi- 
mately 40 packets/second. As long as the recompu- 
tation interval M was small compared to the time 
over which the friendliness or equivalence was being 
measured, the value for r 0 had little impact on the 
performance of TFRCP. 

As mentioned in Section TV, timer inaccuracies and 
overheads force us to send packets out in small bursts, 
instead of clocking them out evenly over the duration 
of each round. The impact of this burstiness on per- 
formance of TFRCP protocol is hard to quantify. On 
one hand, one may imagine that burstiness would lead 
to slightly higher loss rates for TFRCP connection, 
forcing the throughput down, On the other hand, 
traffic from a TCP flow is somewhat bursty as well 
[3]. Thus the impact of bursty nature of TFRCP flow 
on friendliness ratio is hard to judge. 

It should be noted that the formula in (2) is not 
valid for certain network scenarios, such as TCP con- 
nections running over modem lines with large dedi- 
cated buffers [13]. This implies that the TFRCP pro- 
tocol would not work well in these situations either. 
We are currently working on solutions to this prob- 
lem. We also note that TFRCP reacts to changes in 
network conditions only every M time units (i.e. the 
duration of recomputation interval). If the network 
traffic conditions change on a faster time scale, the 
difference between the throughput of a TCP connec- 
tion and a TFRCP connection experiencing similar 



network conditions may be significant. Under such 
dynamic conditions, obtaining accurate loss estimates 
and round trip times can be problematic. We note 
that in real- world testing, Figures 7(a)- 7(c), we have 
found that the protocol works well with a recompu- 
tation interval of three seconds. One may question if 
achieving TCP- Friendliness at large time granularities 
is useful at all. We would like to point out that mul- 
tiple TCP connections going over the same network 
path need not achieve same throughput on a time 
scale comparable to the round trip time. Thus, fair- 
ness needs to be measured over time intervals longer 
than a few round trip times. One must also note that 
very short TCP connections such as HTTP transfers, 
do not achieve friendliness even among themselves. 
Hence, we have restricted ourselves to achieving fair- 
ness between long term TCP and TFRCP connec- 
tions. We believe that as long as the duration of a 
flow is significantly larger than M, the TFRCP pro- 
tocol achieves this goal. 

VI. Conclusions and Future Work 

In this paper we have presented a TCP-friendly rate 
adjustment protocol. The protocol achieves TCP- 
friendliness by changing its sending rate, based on 
TCP characterization developed in (13]. using the 
measured loss rate and round trip times In addition to 
studying the protocol through simulations, we imple- 
mented a prototype version of the protocol and tested 
it with experiments over the Internet. The results 
of both simulation and implementation experiments 
show that the protocol is able to achieve throughputs 
that are close to the the throughput of a TCP connec- 
tion traveling over the same network path. Thus, we 
conclude that formula-based feedback-loop approach 
to congestion control and achieving TCP-friendliness 
is indeed practical. 

We have identified several avenues for future work. 
We are currently working on developing better tech- 
niques for loss rate estimation. We plan to refine the 
implementation of the protocol, especially the imple- 
mentation of various timers. We also plan to investi- 
gate if any other throughput formulas can be used in 
the feedback loop, and their impact on performance 
of the protocol. Above all, we are working towards 
developing a comprehensive protocol for congestion 
control of continuous media flows. The protocol will 
take into account the effects of limited buffer space 
available at the sender and the receiver, along with 
the timeliness requirements and loss tolerance of the 
specific media being sent. 
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Abstrac: — End-to-end QoS control over bcsl» effort and differentiated 
service networks which exhibit variability In their exported sendee prop- 
erties looms as an Imp ur Lam challenge. In previous work, wc have shown 
bt>w packet-level adoptive FEC can be used to dynamic networks to fadll* 
talc Invariant user-specified QoS In an end-to-end manner. 

This paper addresses two Important problems — self-similar burstinesj 
anil performance degradation or reactive controls subject to long feedback 
loops — complementing the stability/optimal! ty considerations studied ear* 
lien First, for adaptive redundancy control to be effective, Its susceptibil- 
ity to correlated packet drops and queueing delays stemming from self- 
simitar burstiness must be fortified. Second, to preserve FEC f s viability 
over AJ*Q whtn transporting real-tune traffic In WANs, proactlvlty must 
be injected to offset tbe performance degradation of reactive feedback con- 
trols when subject to long RTTs. 

In this paper, wc use the recently advanced multiple time scale conges- 
tion eontrol framework — first investigated in the throughput maximize, 
tion context — to endow adaptive redundancy control with both selective 
protection against self- similar burstings as welt as proactlvtty to feed* 
bad: redundancy control. Wo analyze, implement, and benchmark our 
protocol — AFEC-MT— in the context of transporting periodic rcaMime 
traffic, In parti oil or, MPEG video. 

I. Introduction 

A. Background 

Forward error correction (FEC) is a well-studied reliable 
communication technique which has been successfully used, 
primarily at the bit-level, in a number of application domains 
from space communication to reliable data storage on compact 
disks [4], [14], [15], In the context of supporting multime- 
dia traffic with real-time constraints over high-speed wide-area 
networks, packer-level FEC has received interest due to ARQ's 
inherent limitation al handling timing constraints when subject 
to long end-to-end latencies {2], [5], [6]. fl2J ) (19), [23], [28). 

Packet-level FEC introduces further complexities due to cor- 
related packet drops (or erasures) and delays stemming from 
queueing which is especially severe under self-similar bursty 
traffic conditions. Cidon et al. [B], [9] have studied the im- 
pact of correlated packet drops on packet-level FEC perfor- 
mance and shown that their impact can be significant. Queue- 
ing analysis with Poisson input is provided in [1). Empirical 
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grants from PRF and Sprim. 



evidence (5], [23] indicates that performance degradation is 
further amplified under self-similar traffic conditions. 

When applying packet-level FEC for real-time data trans- 
port in shared dynamic networks, it is imperative that appro- 
priate redundancy or overrode — commensurate with network 
state and desired target QoS — is applied such that bandwidth is 
not unnecessarily wasted. In previous work [22], [23], [26] . we 
proposed an adaptive packet-level FEC protocol called AFEC 
and analyzed its properties with respect to optimality and sta- 
bility. The control problem is nontrivial due to the fact that 
increased redundancy, beyond a certain point, can "backfire" 
resulting in self-induced congestion which impedes the timely 
recovery of information at the receiver. Wc implemented and 
tested AFEC in high-speed LAN environments when transport- 
ing real-time MPEG video and showed that end-to-end QoS 
provisioning could be facilitated using adaptive redundancy 
control. 

B. Problem Statement 

The limitations of our previous work [22], [23] are two-fold: 
one, AFEC's adaptive redundancy control was geared toward 
protecting against generic forms of burstiness without special 
sensitivity to self-similar burstiness [16] thus leaving room 
for possible improvement, and two, AFEC — being a feedback 
control— suffered under the problem of long round trip laten- 
cies intrinsic to all reactive controls which reduced its effec- 
tiveness vis-a-vis ARQ in WAN environments. 

We remark that these two problems — although studied in 
the specific contexl of adaptive redundancy control for real- 
time data transport— are also relevant to other forms of end-to- 
end QoS control over networks exporting variable services [7], 
[10], [20] where end-to-end control can be used to amplify and 
endow robustness to the QoS experienced by an application. 

C New Contributions 

In this paper we show that the aforementioned problems — 
self-similar burstiness and feedback redundancy control with 
long RTTs — can be effectively addressed yielding signifi- 
cant performance improvements. Our solution is based on 
the framework of Multiple Time Scale Congestion Control 
(MTSC) [29], [30] which has been recently advanced in the 
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1 S Station for congestion control purposes. 

« ftadme scale of the feedback loop. In a nutshell, when the 
"mention level at large time scales is ■ PJ**2£ 
"low" the bandwidth consumption beha<nor of the underl ng 
Sarkcon8esdoncontroUsmaacmor.asgrcss.ve.andv.ee 

Jsitive transport of real-time traffic, the main technical chal- 

S s ale modute which is then coupled to AFEC such to he 
-ollec ive behavior facilitates both selective protection agomst 
^SSar bursUness and proactivity to counteract the reac- 
We n«ure of AFEC. The specific form of coupltng m the 
APEC-MT-is additive where the amount of 
SEJKSE ™ instant is composed of two pans 
redundancy n pp ^ ^ compQMnl acUng at 

'L tofs ale of the feedback loop governed* AFEC **4 die 
fie time scale component h L . Ute latter behavw Wcea DC 
coLonent" over the short time scale incurring level shi ts at 
CeTme scales which reflect the overall contention level ami 
coipondmg redundancy needed to achieve a target end-to- 
!Ss Sreas hs is updated using implicit predion af- 
forded by feedbackcontroUt is computed usmg uphca pre- 

di S n give a qualitative analysis of AFEC-MT with respect to 
its^bucvandop^^^ 

fcal efncacv of the protocol by implementing and benchmark- 
„f 5k MT when transporting real-time MPEG video over 

Z °™* I =r.d-io-end latency are systematically mjec ed and 
5r K evaluated. Of particular interest is the Usta* 

,° F f V ' 0 f reactive controls when suoject to 
show that significant performance gams are posstbte by engag 

SLSnr^ft^^^eU. IV discusses AFEC- 
5S5 i2SL scale redundancy control, and SccUon V 
t^perfonLce results from implcmentnuon based expert- 
'ments for real-time MPEG video transport. 



degree of redundancy. We will assume that the receipt or cmyk 
packets out of the n total packets suffices to recover the origi. 
nd k data packets. FEC ertcoding/decodtngfuncnons wuh ifo 
propenymcludeReed-SolomonllSl^dlDACT.F.gurelL 
gives a depiction of packet-level FEC in an end-tc-end K>WO ifc 
environment. 
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Fit 11 1 A block or Jt packets encoded at ite wader wing FEC u fc + h 
^ ^te KhemHrti of dropped or untidy ptfto docs no: exceed h. 
the original k data packets are recovered at xhc receiver. 



11. overview of AFEC 
frame— are encoded as n = * + " P» CMi 



The application in question is red-«m* constrained in the 
following sense. The sender transmits n(fc ) , n(t ? ),..., . . ^ 
« < t, if * < 3) blocks of packets at times t, where ^gg 

( » * If \ + MtO i = 1,2 That is. fc(tO is the traf- 

if 1 , £ iSiTil. ■ by the appiication and W ^ 

At corresponding redundancy factor. We model QoS reqwre- 
ments using Wreal-ume constraints whereby we assume 
existence of a monotonically increasing sequence <U<^««BCpa 
S55 at the receiver such mat all Afe)data packets be- 
longing to the i'th block must be recovered betore one 1,. Fo r jggg ^ 
See given a frame rate of 30fps. successive frames ™ lgg||p 

decoding overhead at the receiving end stauon. Qo 2^ 
S at the receiver using a recovery rate prows 7(p 

defined to be the number of packets belonging to btoek. «-^^^ 
"led before time i',. We will say there is a hit at time t ; .f ; ^ pg, 
7*>mZ.. decoding of the ,'th frame was timely -d^gj ^ 
successful. 



B. Adaptive Redundancy Control 

increasing MO ^ wiU adversely affe ct 7 (t + 
some r > 0. That is. tautag 5. 7(* + T ) f V'cf Al > ^ 
U^iLLal relationship between MO ««» 7(/ + r). SW , 



Bpositivehitrate. Figure n.2 ^depicts the ummodri redundancy ^ 

Consider the case ^hen r > *■ As ™?^ ( ^^S .^Si 
S ivcr. we can formulate the lailow** g 
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Fig. ]J2. Unimodal redundancy-recovery rete function 7 = G{h) with rnc.i- 
mum recovery rare 7* and turret recovery rale 7. . 



control lav/ 



dt 



= c(7.-7(«-t)) 



OLI) 



where 7,, J: < 7. < 7*, is the xa/gef recovery rate, e > 0 
is an adjustment parameter, and r > 0 is a delay term intro- 
duced by feedback and network latency. The control algorithm 
as embodied by (EM) just says that if the measured recovery 
rate 7(t - t) is smaller than the target recovery rate 7., then 
the redundancy factor h should be increased, and vice versa. 

Instability ensues when "too much" redundancy is applied at 
the sender which is tantamount to shooting oneself in the foot 
Lei 

H L = i{hrt):b = 9(hhh<hr} % 

H* - {{A.7):A-ffW.A>A*}. 

It can be shown [22], [25] that target operating points belonging 
to (/u,7.) € Hi are asymptotically stable whereas those be- 
longing to (/u,7.) 6 Hn are unstable. To achieve stability, we 
augment the control law given in (II. 1) via a directional check 
given by the sign of drf/dh. If dj/dh > 0, then the system 
is in the stable region (h < h u ) and (II. I) is applied as usual. 
It" on the other hand, dyfdh < 0, then the system finds itself 
in the unstable region (A > A*) and a backoff mechanism — 
dhjdt < 0 — is instituted until d^/dh > 0 at which time wc 
find ourselves again in the stable regime. Note that d^jdh > 0 

The augmented AFEC algorithm containing both the sym- 
metric and asymmetric control components is given by 

7 (t - r)), if d~t{t - r)/dh > 0, 
otherwise. 

Here a > 0 is 3 positive constant. Hence the augmented control 
follows (II. 1). as it should, when h < h* (i.e., (A, 7) e Hi), 
and it performs drastic, asymmetric backoff only when h > h* 
{(h t f) € Br). The backoff mechanism, dh/dt = -an, is 
exponential leading to 0 decay of A of the form e~ c( . 

III. Overview of MTSC 

A. Self Similar Burstiness 

Let {X t ; t 6 Z+) be a lime series which represents the trace 
of daia traffic measured at some fixed lime granularity. We 




define the aggregated series x\ m ^ as 

yf m * = — (JCim-m+l + • • • + Xim), 

m 

That is, X% is partitioned into blocks of size m, their values 
are averaged, and i is used to index these blocks. Let r[k) and 
r (m) (A) denote the autocorrelation functions of X t and x\ m \ 
respectively. Assume X t has finite mean and variance. Xt 
is asymptotically second-order self-similar with parameter H 
(l/2<#<l)ifforaiU> 1, 

r (m) {k) - + 1) 3 * -Vpi + ik-l)™) (III.1) 

as m -> 00. H is called the Hurst parameter and its range 
1/2 < H < 1 plays a crucial role. The significance of (III 1) 
stems from the following two properties being satisfied: 

(i) rW(fc) ~ r(*) t 

(u) r{k) - c/r* 3 , 

as k -V 00 where O< ( 0<landc>Oisa constant. Property 
<i) states that the correlation structure is preserved with respect 
to time aggregation, and it is in this second-order sense that X t 
is "self-similar." Property (ii) says that r(k) behaves byperbot- 
ically which implies ££L 0 r(k) = 00. This is referred to as 
long-range dependence (LRD). The second property hinges on 
the assumption that 1/2 < H < 1 as H = 1 - /?/2. 

The relevance of asymptotic second-order self-similarity for 
network traffic derives from the fact that it plays the role of 
& •'canonical" model where the on/off mode) of Willingcr et 
aL i [31 ], Likhanov et aVz source model [17], and the M/C/co 
queueing model with heavy-tailed service times [II] — among 
others— all lead to second-order self-similarity. In general, 
self-similarity and long-range dependence are not equivalent. 
For example, fractional Brownian motion with H = 1/2 is 
self-similar but it is not long-range dependent. For second- 
order self- similarity, however, one implies the other and it is 
for this reason that we sometimes use the terms interchangably 
within the traffic modeling context. A more comprehensive dis- 
cussion can be found in [24], 

B. LRD and Predictability 

Given Xt and x\ m \ we will be interested in estimating 
Pr{X^ l 1 ) I X\ m) ) for some suitable aggregation level m > 1. 
lxXt is short-range dependent, we have 

for large m whereas for long-range dependent traffic, correla- 
tion provided by conditioning is preserved. Thus given traffic 
observations a, b > 0 (a ^ b) of the "recent" past correspond- 
ing to time scale m. 



1 Thin is, via its relation lo factional Browmun motion and its increment 
process, fractional Gam wan noiie. 
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and ihis information may be exploited to enhance congestion 
control actions undertaken at smaller lime scales. We- em- 
ploy a simple, easy-to-implement— both on-line and off-line — 
prediction scheme to estimate Pr{A'&? i A'} m) } based on ob- 
served empirical distribution. We note that optimum estimation 
is a difficult problem for LRD traffic [3], and its solution is out- 
side the scope of this paper. Our estimation scheme provides 
sufficient accuracy with respect to extracting predictability and 
is computationally efficient, however, it can be substituted by 
any other scheme if the toiler is deemed "superior" without af- 
fecting the conclusions of our results. To facilitate normalized 
contention-levels, we define a map L : R+ -* [1, s], monotone 
in its argument, and let x| m) - L(Af °). Thus x\ m) « 1 
is interpreted as the aggregate traffic level at time scale m be- 
ing 'Mow" and L k * * ^ understood as the tragic level being 
"high." The process x| m) is related to the level process used 
in [13] for modeling LRD traffic. We use Li and L x with- 
out reference to the specific time index i to denote consecutive 
quantized traffic levels xj' R \ x{™j . 




■ ^ 1 

Tribe U*HU • • Tratx UvolU 



Fig. til. I. Conditional probability densities with conditioned on Li for 
LRD traffic Hop) and SRD traffic (bottom). 

Figure III. I (top) shows the predictability structure of LRD 
traffic at a time scale of 5s by plotting its 3-D conditional prob- 
ability densities. The diagonal skewness indicates that condi- 
tioning on L\ is informative with respect to predicting L 2 . Fig- 
ure HI. 1 (bottom) shows the corresponding densities for short- 
range dependent traffic. We observe that conditioning has neg- 
ligible influence. 



C. Coupling 

Multiple time scale congestion control allows for n- level 
time scale congestion control (n > 1) where information ex- 
tracted at n separate time scales T; < T« < * * ■ < ?n is 
cooperatively engaged to modulate the output behavior of the 
feedback congestion control residing at the lowest time scale 
(i.e., n = 1). The objective of MTSC is to improve perfor- 
mance vis-a-vis the congestion control consisting of the feed- 
back congestion control alone. We concentrate on 2-time scale 
congestion control where the "large" time scale module Ci— 
separated by an order of magnitude from the ''small" lime scnlc 
module Cs — is coupled to the latter to yield a new control 
Clqs- For throughput maximization, for example, the cou- 
pling takes on a multiplicative form [29]. For QoS control us- 

Luw Contention 

nr. u*ti Sr 1 //«* m: tx%ei 

Urtl Shift 

Fig. HU. Additive coupling via selective "DC" level adjustment— i.e., Itvcl 
shift— between high- 2nd low-contention periods. 

ing adaptive FEC, we employ additive coupling. The latter is 
illustrated in Figure I1I.2 where a "DC* level shift is instituted 
with respect to the large time scale rate which results in an in- 
crease of the base rate from Xi to > w . 

IV. AFEC-MT 
A. Multiple Time Scale Redundancy Control 

AFEC-MT, in general, allows for n-levcl time scale redun- 
dancy control for n > 1 where information extracted at n sep- 
arate time scales is add i lively coupled to yield the level of re- 
dundancy A applied at the FEC encoder. This is depicted in 
Figure IV. I. 

Our design methodology is based on devising the large time 
scale module at time scale T s , attaching it to the AFEC module 
operating at lime scale Ti of the feedback loop to improve the 
control actions undertaken by AFEC. Our objective is 10 show 
that this modular extension results in a control which is able to 
achieve significant performance gains relative to AFEC. In the 
2-timc scale redundancy control setting, the explicit prediction 
component of the large time scale module Cl outputs a redun- 
dancy level /12) which is then addirivety combined with 
the redundancy level A 5 (= h\) computed by C s . i.e., AFEC. 
h = hs + hi is then passed on to the FEC encoder component 
of AFEC. In specifying the control law 

dh _ dhs . dhi 

dt~!h" r dt 
the control governing h s * AFEC (cf. Section n-B) hence 
needs no separate description. 
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Fig-lV.l. AF5C-MT framework. Dashed lines show the potential extensibility 
of the framework ro three or more time scales. 



B. Structure of Ci 

Whcrciis dhs/dt is affected at the time scale Ti of AFECs 
feedback loop. dh L /dt is affected at the much larger lime scale 
Ta. This implies that the frequency of updates for h$ are much 
greater than that of Ax,. Now to the description of dhifdt. 

B.l Explicit Prediction 

The explicit prediction component of d performs per- 
connection on-line estimation of the conditional probability 
densities Vr{L«\Li = fe}, k E [1,5], following the method 
outlined in Section TI1-B. It turns out that on-line estimation 
can be accomplished using 0(1) operations at every update in- 
terval, i.e., Cls time scale To. On the sender side, d main- 
tains a 2-dimcnsional array CondProbHN of size s x (s + 1}, 
one row for each k 6 [1, s]. The last column of CondProb, 
CondProb[fc][s+l], is used to keep track of the number of 
blocks observed thus far whose traffic level map to Since 
Pr{I 2 = I \L X = Jt} = CondProb(i][£]/Cond?rob[A][5^ 
1), having the table CondProb means having the conditional 
probability densities. 

B.2 Selective Redundancy Control 

Using the conditional probability density table CondProb, 
we compute the expectation of conditioned on L\, t - 
E(X* \Li = W. * 6 [l t s] t at the end of each L x . t is then 
used" to index the redundancy schedule H : (1,5) -* &r to 
yield the value 

What remains is the computation of the function H, Let 
A J t /r,. denote the function values of H. Each com- 
ponent h l is updated according to the symmetric control 

dh l _ \y^ if d-t/dh £ > 0 and 7. > 7. 
~dT " itd'ffdh 1 < 0 or 7 - 7. > 0, 

where t/ > 0 is an adjustment factor and 0 > 0 is a threshold 
parameter. The sign of d-y/dh! can be estimated by maintain- 



ing a history of redundancy action-QoS impect pair sequences, 
one for each contention level t e [1,5]- At time t = 0, the 
initial values of H arc set to zero. The value of h l % te {l,a), 
is updated at the end of each block L\ whose conditional ex- 
pectation maps to t. 

V. REAL-TIME MPEG VIDEO TRANSPORT 

A System Structure 

We have built an implementation of AFEC-MT customized 
for the transport of real-time MPEG video. For brevity, we 
will refer to this system as AFEC-MT in the following sections. 
AFEC-MT consists of a number of modules including the FEC 
codec, receiver-side controller Cr, and sender-side controllers 
C| and C|. Q odjusts short-range redundancy hj by react- 
ing to feedback at the time scale of KIT. C 2 S is implemented 
"on top" of C l s via the coupling described in Section IV and 
sets the long-range redundancy level ftj. The net redundancy 
h = [hi + hi]* is then input to the FEC encoder. The sys- 
tem can be configured to turn off either one of the two control 
modules. Thus when C| is disabled, the system degenerates to 
AFEC. On the sender side, a stream of /, P, B frames is gen- 
erated by an MPEG encoder at some frame rate / (e.g., 25-30 
frames/sec). The FEC encoder applies forward error correction 
on each frame producing a sequence of n — k + h packets 
which are submitted to the network. 

At the receiver, upon receiving a sequence of packets be- 
longing to the same frame, the controller Cn checks if at least 
k packets have arrived in a umely manner. If the number of 
timely packets is less than fc, then the packets belonging to 
the current frame arc discarded including those arriving sub- 
sequently. If at least k packets arrived within their deadline, 
then the first k packets (any fc-subset will do) are forwarded 
to the FEC decoder proper which decodes the packet stream 
to recover the original k data packets constituting the sender's 
frame. The I, P t or 3 frame is then forwarded to the MPEG 
H player which applies its own decoding to produce the un- 
compressed frame which is then rendered on the terminal as 
pan of the video stream. Wc employ Rabin's IDA 127] as the 
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Fig. V.l. System structure of AFEC-MT. 

FEC codec. Details of the FEC implementation can be found in 
(22], [23]. Cn computes certain control information including 
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the numbs- of timely arrived packets 7 which is then fed back 
via a control packet to the sender. All of ihe modules are imple- 
mented in software, and it is completely end-to-end such that 
the system can be deployed over any I? network. The fonvard 
{low structure is shown in Figure V.l. 

B. Experimental Sehup 

B.J Network Configuration 

Our experiments were carried out in the test environment 
shown tn Figure V.2. AFEC-MT. consisting of the AFEC-MT 
sender and receiver, sent its traffic over a router where cross 
traffic stemming from a separate cross traffic source was multi- 
plexed with the application traffic. By varying the cross traffic 
characteristics as well as the resources (e.g., buffer capacity) at 
ihe router, a wide spectrum of contendon levels could be pro- 
duced. To facilitate a controlled environment for measurement 
purposes, the camera and MPEG I encoder at the sender were 
replaced by an emulator that fed the frames of stored MPEG I 
video at real-time frame rates 10 the AFEC-MT sender. Thus 
as far as AFEC-MT was concerned, its functioning was not 
affected since— in cither case— i\ P, B frames were input to 
the FEC encoder at real-time speeds 2 . The MPEG n player 
at the receiver was always run, rind the performance measure- 
ments reflect the overhead incurred by the player and FEC de- 
coding processing overhead. Thus modulo where the J, P t B 
frsmeti came from at the sender-side, the system depicted in 
Figure V. 1 was faithfully implemented with all the components 
implemented in software without specialized hardware support. 



AFEC-MT itftder router 



AFEC-MT receiver 



crou traffic 



rig. V.2. Experiment sel-up for AFEC-MT performance measurements. 

The topology depicted in Figure V.2 was realized over a 
private FastEthcrnci LAN environment v/ith Ihe AFEC-MT 
sender/receiver, cross traffic source, and router running on four 
Sun UltrnSparc 1 & 2 workstations. Without the configurable 
router, wo found that the 100 Mbps bandwidth of FastEthernel 
wes too large relative to the data rate of MPEG 1 video (~1.5 
Mbps), even with our cross traffic source active, to cause sig- 
nificant contention. With an UltraSparc 2 node acting qs a con- 
figurable router implementing FIFO packet scheduling, a wide 
range of contention levels could be created under controlled 
conditions. The AFEC-MT protocol as well as (he cross traffic 
source ran on top of UDP. AFEC-MT was realized as an appli- 
cation layer process portable to other UNDC environments. 

3 An unpkmsDtaiioo of ihe AFEC-MT sender which Interfaces with tn 
Opiifcosc real-time MPEG J compression engine Is mailable for Windows NT. 



32 Benchmark Traffic 

Self-similar cross traffic was generated by utilizing traces 
from [21]. V/e emphasize that the performance results are 
obtained using actual MPEG I video rather than frame size 
traces commonly used in simuladon and experimentation set- 
ups* Thus the cost of FEC encoding and decoding, control ac- 
tions, and MPEG 11 player's processing overhead are all re- 
flected in the performance results. We use the MPEG I video 
clip, Beauty and the Beast, as the main benchmark data source. 
The clip consists of 36000 frames which, at 15 C/s frame rate, 
has a duration of 40 minutes. Other real-time video data em- 
ployed include Simpsons cartoon and Terminator clips, 

C. Performance Measurement 

CA Unimodal Redundancy-Recovery Relation 

Figure V.3 (top) shows the measured redundancy-recovery 
function for the Beauty and the Beast MPEG I video clip trans- 
mission when using static FEC with h fixed in the range 0-14. 
We observe that the redundancy -recovery function is unimodal 
with peak at h - 8. Figure V.3 (bottom) shows the correspond- 




Fig. V.3. Top: Measured unlrnoda] redundancy -recovery function for Beauty 
and the Beosi MPEG 1 video clip.' Bottom: Corresponding pazfcel Joss rate 
function. 



ing packet loss rate curve which mcmotonically increases v/ith 
h. The increase in packet loss rate stems from the higher traffic 
rate associated with increased redundancy. 
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Rg- VA Hit trace for static FEC (top). AFEC {middfc>, and AFEC-MT (boitorai far sclr-similcr cress traffic naive during the middlr »ms imcrval {1000,5000). 
silent otherwise. 



C.2 AFEC-MT vs. AFEC 

The dynamical performance of AFEC-MT vis-a-vis AFEC 
nnd static FEC can be gleamed from Figure VA The figures 
show impulse plots over Urne — represented as frame sequence 
numbers— where the presence of a unit impulse indica'es that 
the corresponding frame was decoded timely at the receiver. 
The absence of unit impulse (i.e., white stripe) shows frames 
which did not meet their deadline. Figure V.4 (top) shows 
performance of static FEC with h - 0 when a self-similar 
cross traffic source was active during the middle time inter- 
val [1000.5000] while being silent otherwise. Interference of 
cross traffic sharing a FIFO queue at the router degrades per- 
formance of the application flow which manifests itself as de- 
graded QoS— i.e., hit rate— during the middle interval. Fig- 
ure V.4 (middle) shows corresponding performance of AFEC 
for the same set-up and cross traffic conditions. We observe 
thai performance is significantly improved as shown by the re- 
duction in missed deadlines which translates to improved end- 
to-end QoS as perceived by the user. Figure V.4 (bottom) 
shows performance of AFEC-iMT which further improves upon 
AFECs achieved QoS yielding a measured hit rate that is close 
to the user's desired hit rate. Wc observe a small interval of 
timeliness violation at time 1000 — the onset of cross traffic — 
which is a nonstationary, unpredictable event to which AFEC- 
MT then adjusts subsequently. 

Whereas Figure V.4 depicts "instantaneous" QoS, Figure V,5 
shows the running average or measured hit rate at the receiver 
for a set-up involving 12000 frames during which the self- 
similar cross traffic was always active. Figure V.5 (left) shows 
mean hit rate for static FEC with h = 0 whose hit rate at 0.21 
is significantly below the target hit rate of 0.92. The middle 
figure shows corresponding performance of AFEC which im- 
proves upon the performance of static FEC but is slill notice- 



ably below the user-specified target level. Figure V.5 (right) 
shows performance of AFEC-MT which converges to the target 
hit rate after a transient initial adjustment period. The target hit 
raic is reached much faster than indicated in the plot: the latter 
shows the long-term running average of instantaneous hit rates 
from time 0 onwards, not u local window. 

We remark that the improvement of AFEC-MT over AFEC 
stems from multiple time scale AFECs ability to more accu- 
rately discern short-term fluctuations from persistent changes 
which allows it to be less reactive to short-range variations. 
AFECs advantage over static FEC lies in is its ability to tailor 
redundancy to current network state, increasing it when needed 
to shield QoS, and decreasing redundancy when not needed to 
reduce wastage of shared network resources. This trade-off be- 
tween QoS and bandwidth imparts a cost for reducing redun- 
dancy in response to short-term changes in network state as, in 
the near future, a QoS violation is likely to arise due to reduced 
protection which will then subsequently trigger an increase in 
redundancy. Bandwidth may have been temporarily "saved" by 
decreasing redundancy in response to short-term changes, but 
given the prunary goal of achieving a target QoS while mini- 
mizing bandwidth usage in so doing — a secondary objective — 
responsiveness to short-term fluctuations is undesirable. Thus 
AFTEC-MTs ability to discriminate and undertake differenti- 
ated action with respect to redundancy in the form uf a short- 
range component k\ and long-range component hi endows it 
with enhanced prowess to deliver end-to-end QoS while being 
efficient with respect to its resource usage. 

C.3 RTT and Proactivity 

As the round trip time (RTT) associated with the feedback 
loop increases, the state information conveyed by feedback be- 
comes more outdated, and the effectiveness of reactive actions 
undertaken by u feedback control diminishes. The penalty is 
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Fig. V.5. QoS performance comparison: static FEC (left). AFEC (middle), and AFEC-MT (right). 



amplified in broadband wide area networks where the delay - 
bandwidLh product increases proportionally with delay or band- 
width. Figure V.6* (left) shows measured hit rate as a function of 
RTT for both AFEC-MT and AFEC We observe that AFEC's 
hit rate decreases significantly as RTT is increased which is 
commensurate with the effect of outdatedness of feedback in- 
formation, AFEC-MT, on the other hand, suffers significantly 
Uss under the same conditions maintaining a much flatter per- 
formance curve. Thus AFEC-MT is able to mitigate part of the 
cost incurred by reactive control by exploiting long-range cor- 
relation structure to reduce the performance impact of outdated 
feedback. 

Figure V.6 (right) shows the relative performance gain of 
AFEC-MT vis-a-vis AFEC where performance gain u is de- 
fined as 

„ - TaPEC-MT " 'if AFEC 
7A FEC 

Assuming 7afec-mt > 7afec. " > 0 represents the percent- 
age of improvement achieved by AFEC-MT over AFEC. Wc 
observe that u increases with RTT indicating that AFEC's sus- 
ceptibility to long round-Drip times increases vis-a-vis the cor- 
responding susceptibility of AFEC-MT. 

C.4 Impact of Long-range Dependence 

Wc have shown that the correlation structure in self-similar 
traffic— upon effective utilization by AFEC-MT— leads to en- 
hanced QoS above and beyond what AFEC can provide. Yet 
another dimension of interest is the impact of long-range cor- 
relation structure — i.e>, its strength— on performance gain. Ta- 
ble t shows the hit rate of AFEC, AFEC-MT, and performance 
gain when the long-range dependence present of cross traffic is 
increased from weak (a — 1.95) to strong (a = 1.05), Net- 
work traffic measurements correspond to traffic with a « 1. 
The a measure is related to the Hurst parameter for measuring 
long-range dependence, and we refer the reader to [21], [30] 
for a more detailed discussion. 

Table I shows that performance gain amplifies as long-range 
dependence is increased with a approaching I. Thus at the 
same time that long-range dependence exerts a negative influ- 
ence on performance from the queueing perspective, the same 



structure can be exploited to affect control decisions that miti- 
gate the very performance effects that are caused by long-range 
correlations in the first place. This is the "good news within the 
bad news" syndrome [29]. We remark that when varying the 
long-range dependence associated with cross traffic, it is impor- 
tant that all other things are kept equal — including the average 
traffic rate — to preserve comparability. Generating traffic loads 
wiih this property is a nonuivial task due to sampling error in- 
troduced when engaging heavy- tailed distributions in physical 
traffic models. Details for generating normalized workloads is 
provided in [30}. 

TABLE I 

Impact of long-range dependence on performance gain of 

AFEC-MT VS. AFEC. 



Q 


LOS 


1.35 


1.65 


1.95 


AFEC 


0.764 


0.831 


0.382 


0.920 


AFEC-MT 


0.919 


0.918 


0.921 


0.917 


Gain 


20.3% 


10.4756 


4.42% 


-0.0% 



Table 1 also shows that when traffic is weakly correlated 
at large time scales (a = 1.95), the performance difference 
between AFEC-MT and AFEC is minimal (0.917 vs. 0.920). 
This is not surprising since given that there is little structure 
at large rime scales to exploit, the performance benefit en- 
suing from coupling AFEC with the large time scale control 
module should be commcnsurately low or nonexistent. At the 
very least, AFEC-MT should not "hurt" performance vis-a-vis 
AFEC which wc find is the case. 

D, Redundancy Schedule 

Related to the impact of long-range dependence is the time 
evolution of the redundancy schedule #(•) which for pre- 
dicted traffic level t at time scale Ti assigns die base redun- 
dancy hi — H[l)> The update of the components of #(t), 
i = 1, 2, . . . , 8, is affected by the symmetric control law given 
in Section 1V-B.2. Figure V.7 (top) shows the evolution of H (•) 
as a function of time (represented as frame numbers here) for 
a sb 1.05 traffic. Figure V.7 (bottom) shows the corresponding 
values for a = 1.95 traffic. We observe that when large time 
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Fis- V.6. Proactivity of AFEC-MT as a function or RTT. Left: Hit nte of AFEC-MT vs. hit rale of AFEC. Right Performance gnin of AFEC-MT relative io 
AFEC 



scale correlation structure is weak, then the values are concen- 
trated around a narrow range (i.e., 3-5) which points toward the 
fact that conditioning on predicted large time scale traffic level 
is of limited utility. For a = 1.05 traffic, on the other hand t 
the redundancy values are spread out over a much wider range 
(i.e., 0-8) which indicates that conditioning docs provide the 
ability to discriminate with respect to the future. In particular, 
this allows a base redundancy of Ax =8 lo be applied when 
traffic contention is high {£ — 8) to facilitate increased protec- 
tion* and a correspondingly small "DC redundancy hi - 0 
when persistent traffic contention is low (i = 1). 
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Pig. V,7, Evolution of redundancy schedule as a function of lime. Top: a — 
1.05 traffic. Bottom: a £= l.DS traffic. 



D. ! Multiple AFEC-MT Connections 

AFEC-MT is an end-to-end protocol designed to run in 
shared network environments where multiple AFEC-MT flows 



compete for available resources. As seen in the comrol laws 
for AFEC and its large time scale module that adjusts the re- 
dundancy schedule (cf. Sections IV-B.2 and II-B), AFEC-MT 
tries to achieve a target QoS — i.e., hit rate — while applying an 
amount of redundancy deemed adequate to do so, but not more. 
In this sense, it is a cooperative protocol, just like TCP, which 
stands in contrast to noncooperative protocols that would ex- 
ploit other flows' cooperativeness and not back off even under 
adverse conditions 5 , thereby absorbing their bandwidth [7]. In 
fact, if the user-specified target hit rate is sufficiently low, then 
AFEC-MT— when compared to TCP — may end up consuming 
less bandwidth than TCP which tries to maximize throughput. 
Of course, AFEC-MT can be transformed into a protocol which 
seeks to maximize QoS, in which case, its modus operandi is 
analogous to that of TCP: back-off is instituted only when a 
further increase in redundancy would result in a decreased hit 
rate. 

Table n shows hit rate for a network environment consisting 
of three AFEC-MT connections. The three connections com- 
pete for resources at the same bottleneck router shown in Fig- 
ure V.2 as for the single AFEC-MT connection case. The cross 
traffic course remains the same. Table II shows bandwidth shar- 
ing behavior for three network conditions — when the cross traf- 
fic source is sending at a rate of 6.4Mbps (network contention 
ii; high), 4.8Mbps, and 3.2Mbps (network contention is low). 
TTie target hit rate for each connection was set at 0.92. When 
available bandwidth is smell (first column), network resources 
are insufficient to achieve the target hit rate 0.92. The flows re- 
main stable due to the back-off mechanism of AFEC-MT, and 
all three connections end up sharing an approximately equal 
amount of bandwidth which results in similar hit rates of 0.56, 
0.53, and 0.57, respectively. Note that this is equivalent to a 
"maximize QoS" mode. Bandwidth sharing behavior stays in- 
variant as the traffic rate of the cross traffic source is decreased, 
eventually leading to all three connections achieving their tar- 
get hit rate 0.92 (the second connection is a fraction short) when 
network contention is low (last column). Performance evalus- 

3 Thai is. network state where increasing redundancy leads io a decrease in 
hit rate Tor the flow in question. 
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lion with Four or more connections were carried out using sim- 
ulations yielding qualitatively similar results. 

TABLE II 

Q05 PHHFOJIMA.VCE OF MULTIPLE AFEC-MT CONNECTIONS WITH 
RESPECT TO FAIRNESS. 



! Cross Traffic 


6.4 Mb/s 


4.8 Mb/s 


3.2 Mb/s 


I Connection I 


0.56 


0.82 


0.92 


Connection 11 


0.53 


0.83 


0.91 


j Connection III 


0.57 


0.79 


0.92 



VT. Conclusion 

In this paper, we have introduced a multiple time scale exten- 
sion of AFEC— a protocol that performs packet-level adaptive 
FEC tor real-time payload transport [22], [231, [26]. A large 
lime scale module lhat extracts long-range correlation struc- 
ture in network contention was constructed and coupled with 
/iFEC yielding its extension AFEC-MT. The targe time scale 
module augmented the feedback redundancy control mecha- 
nism employed by AFEC by addiiively engaging a "DC" re- 
dundancy level as a function of predicted, long-range network 
siaic. Ws implemented AFEC-MT for real-time MPEG video 
transport and showed that significant performance gains are 
achievable by utilizing long-range correlation structure present 
in self-similar trafik. An important consequence of AFEC-MT 
is its ability to impart proactivity when subject to long round- 
trip times. By exploiting long-range correlation structure at 
time scales an order of magnitude higher than the RTT of the 
feedback loop, the detrimental penormasce effect of outdated 
feedback information is mitigated, which is especially severe 
in broadband wide area networks that possess a high delay- 
bandwidth product. 
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•t We present a new transport protocol culled HPF fur effectively 
jcuppiH'Ung heterogeneous packet (lows in the InLenict environment. The 
udh»riiig nre the key failures ufllPF: 

IIPF supports packet flows where different packets to the same trans- 
port connection have different qunliiy-of-serviec requirements in term of 
n-UahUitv, priority, and deadlines. 

HPF supports application-level framing, und provide* APIs Tor uppli- 
rations w specify the priority, reliability and tuning requirements of each 
t'rotiie- 

, HPF enables the use of application-specified priorities as hints for eel* 
ivurk routers to preferentially drop low-priority packets during cony estl on. 
This tmurw that 'important data* yets through preferentially during am- 
ecstfon. 

. HPF decouples the coiiRCStion control and reliability mechanisms w or- 
der to support congestion control for unreliable nnd heterogeneous packet 

preliminary pcrfurnwncc measurements In our experimental testbed 
show that UPFeun provide effective support for heterogeneous packet Hows 
lii the presence of dynamic network resources. 

I. Introduction 

The explosion in the use of the Internet in recent years has ex- 
posed several limitations in its design. Until recently, user traf- 
fic has been predominantly data-oriented, resulting in transport 
protocols lhai have focused on providing reliable data transport 
and congestion control for flows in which all packets have the 
same requirements in terms of reliability, sequencing and time- 
liness. However, it is clear that the nature of user traffic in com- 
ing years will become increasingly multimedia oriented and het- 
erogeneous — for example, MPEG flows with multiple priority 
frames, multiplexed audio and video, multiple HTTP requests 
for texl pages and images over the same Lranspon connection, 
etc. For such heterogeneous packer flows, different packets in 
the same flow will have different qttalty of service requirements 
in terms of priority, reliability, and timing. 

Current internet transport protocols do not support the con- 
cept of heterogeneous flows; for example, TCP guarantees reli- 
able and sequenced delivery for all packets in a flow, while UDP 
docs not guarantee either delivery or sequencing for any packet. 
Thus, applications that have heterogeneous flows are forced to 
use multiple independent (homogeneous) flows and then explic- 
itly synchronize ihem above the transport layer | i |, {5], [6] — 
this is both complex for applications, and less efficient for adapt- 
ing to the dynamics of the network. 

Recently proposed alternatives to TCP and UDP include 
user-level transport protocols (e.g. RTP[7] and XTP18]) for 
multimedia-oriented applications, and application level fram- 
ing (ALF)12J. RTP provides congestion control, (low control, 
and liming functionality in a user-level protocol on top of UDP, 
while ALH moves much of the transport- level functionality into 



the application in order to enable application-specific handling 
of packet flows. We concur with the general principles enun- 
ciated by both ALF and RTP that applications should have the 
ability to specify the policies for framing, reliability, timing, and 
priority at the granularity of application-specific frames for their 
packet flows. We also believe that there should be a clean sep- 
aration between policies and mechanisms - applications should 
specify the policies (at the application- specific frame level rather 
than at the packet Mow level), but the transport protocol should 
provide the mechanisms to implement these policies. Enabling 
applications to specify policies at a granularity that is finer than 
a packet flow allows for heterogeneous packet flows; providing 
the mechanisms of the transport layer in the kernel rather than 
in the application (or a user- level protocol) allows for efficient 
mechanisms and simpler applications. 

In this paper, we describe a transport protocol called HPF for 
effectively supporting //etcrogeneous packet /lows in an Inter- 
net environment |3|. HPF has four key features: (a) support 
for reliable and unreliable packets with different priority and 
timing (delay) requirements in the same transport connection, 
(b) support for application- level framing, and for applications to 
specify the priority, reliability and liming requirements of each 
frame, (c) use of application-specified priorities as hints for net- 
work routers it* preferentially drop low- priority packets during 
congestion, and (d) decoupling of the congestion control and re- 
liability mechanisms in order to support congestion control for 
unreliable and heterogeneous packet flows. In this paper, we fo- 
cus on those aspects of HPF that relate directly to the support of 
heterogeneous packet flows, i.e. the first three features. 

The rest of the paper is organized as follows. Section 11 briefly 
discusses the goals of HPF. Section IT I describes the HPF archi- 
tecture and design. Section IV presents a performance evalua- 
tion of HPF, and Section V concludes the paper. 

H. Motivation and Goals 

Applications typically deal with data in terms of blocks or 
'frames' which may have a different size than network-leve! 
packets. For heterogeneous packet flows, different frames may 
have different quality of service requirements in terms of relia- 
bility, priority, and deadlines. At the same time, some functions 
such as connection management and flow control are noi frame- 
specific attributes. Thus, ihe goal of HPF is to provide mecha- 
nisms for connection establishment, flow control and congestion 
control on a per- flow basis, and mechanisms for reliability, se- 
quencing, framing, prioritization, and timing requirements on a 
per- frame basis, where the per- frame policy is specified by the 
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application. HPF seeks to provide a clean separation between 
policies (controlled by the application), and mcchnnismr (con- 
trolled by the transport layer in the kernel). In summary, HPF 
seeks to combine the flexibility of heterogeneous packet (lows 
with the efficiency of homogeneous packet flows. 

While HPF docs not require any special mechanisms within 
the network in order to operate, it passes down the application- 
defined hints about packet/frame priorities to the network, End- 
to-end congestion control mechanisms take on the order of a 
few round trips to take effect. Thus, while end-to-end conges- 
tion control can be used as a measure to adapt to longer term 
network variations, the network typically reacts to short-term 
(instantaneous) network dynamics by dropping packets during 
congestion. For heterogeneous packet flows, the network can 
improve the end-to-end perception of network quality by pref- 
erential iy cropping low priority packets instead of higher prior- 
ity packets [4j. Note that priority-based packet dropping will 
not change the number of packets dropped, but it can reduce or 
eliminate the number of high priority packets dropped. 

in addition to supporting heterogeneity in frame/packet level 
requirements. HPF must also provide effective congestion con- 
trol mechanisms at the transport layer. In TCP. congestion con- 
trol is closely coupled with reliability-cumulative acknowledge- 
ments serve as feedback for both purposes. However, since HPF 
supports both unreliable and reliable packets in a single flow, 
congestion control is decoupled from reliability in HPF. While 
the details of the congestion control algorithm are beyond the 
scope of this paper, we refer the reader to [3}. 

III. HPF Architecture and design 

KPF is a connection-oriented transport protocol. It allows ap- 
plications to specify blocks of data called 'frames', and to pro- 
vide frame-specific policies for reliability, priority and timing 
(we call these the policy parameters). A frame is then treated as 
a single entity, i.e. all the packets belonging to the same frame 
will have the same policy parameters, and a frame may be read 
or written as a single unit by the application. While future de- 
signs of HPF will include an option for out-of-order delivery 
of frames to applications, the current design of HPF guarantees 
in-scquence delivery (though unreliable frames/packets may be 
lost). HPF provides flexible policies for applications to read and 
write daia in terms of either data streams or frames. 

Figure I shows the overview of the HPF architecture, HPF is 
composed of three logical sub-layers. 

# Applicaiion framing (AF) sub-layer. The AF sub-layer is re- 
sponsible for convening frames into packets at the sender and 
packets into frames at the receiver, prioritization of frames and 
packets, and providing the application interface to read and write 
frames and packets with frame-specific policies. 

• Windowing, reliability, timing and flow-control (WRTF) sub- 
layer. The WRTF sub-layer is responsible for coordinating 
the window advancement between the sender and the receiver, 
flow control, reliably transmitting loss sensitive packets, delet- 
ing deadline-based packets which cannot meet their deadlines, 
and for providing sequencing for the heterogeneous packet flow. 

♦ Congestion control (CC) sub- layer. The CC sub-layer is re- 
sponsible for congestion control,, and for the estimation of the 
rate and round-trip time parameters for a connection. 
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The three sub- layers perform distinct tasks, but each sub- layer 
uses information from the other two sub-layers. We focus on the 
first two sub- layers in this paper. 

A Application Framing Sub-Layer 

We describe the overview of the AF sub-layer below, and re- 
visit the application interface in Section J1I-C. 

At the sender, the AF sub-layer provides the applications with 
the ability to write frames'with different policies for reliability, 
priority, and timing. Briefly, the AF layer takes a frame writ- 
ten by the application and breaks it up into packets of size MSS 
bytes (where MSS, the maximum segment size, is negotiated 
during connection establishment as in TCP). Each packet has. 
a sequence number (which is the byte sequence number of the 
start of the packet, similar to TCP), and the following parame- 
ters: 

■ 1 bit reliability field: if set, the packet is reliable, otherwise 
the packet is unreliable. 

• n bit priority field; 0 to 2 n - 1 in descending order. Since HPF 
guarantees sequencing, reliable packets have a priority level of 
0 (highest). (Our implementation currently uses n « 2). 

• 1 6 bit delay field: the delay in milliseconds that the packet can 
tolerate. A delay of 0 indicates a packet with no delay bound. 
There are three types of packets in HPF: (a) reliable packets, 
which do not have a delay bound but have the reliability field 
set, (b) unreliable delay-bounded packets, which have a non- 
zero delay bound, and (c) unreliable best-effort packets, which 
have the delay field set to 0, and have the reliability field reset. 

• 1 bit frame field: if set, indicates the end of a frame. HPF 
allows applications to read and write data as frames or streams. 
In stream mode, each packet has the frame field set. In frame 
mode, only the last packet of each frame has this field set. 

AH packets of a frame contain the same policy parameters, 
which are specified by the application policy for the frame. The- 
AF sub-layer sets the frame bit of the last packet of a frame, and 
adds an optional field to the header of this packet containing the 
sequence number of the start packet of the frame. Thus, upon 
reception of the last packet of a frame, the receiver can begin 
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reconstructing the frame. Once the AF sub-layer creates a se- 
quence of packets for a frame, it passes this packet sequence to 
the WRTF sub- layer. Note that there is no copying of data in- 
volved in passing packets from one sub-layer to another, since 
Jl lrtc sub- layers use the same send/receive buffer for the con- 
nection. 

tj Window, Reliability Timing, and Flow Conirol Sub-Layer 

The WRTF sub- layer is the core of HPF because it provides 
the crucial functions of managing packets with heterogeneous 
requirements, and flow control. Specifically, the WRTF sub- 
layer performs the following tasks. 

t 'Sequencing: The WRTF sub-layer guarantees that packets are 
delivered in sequence to the application. 
, Deadline tagging The WRTF sub- layer converts the relative 
deadline for a packet {from the AF layer) to an absolute dead- 
line, and associates this deadline with die packet. 
f Retransmissions: The WRTF sub- layer handles retransmis- 
sions for lost packets depending on the packet type. A reliable 
packet is retransmitted until it has been acknowledged. An un- 
reliable delay-bounded packet is retransmitted until it has either 
been acknowledged or its deadline has expired. An unreliable 
best-effort packet is never retransmitted. 
. Flow control: .As in TCP, the WRTF sub-layer uses receiver- 
initiated 'cumulative 1 end-to-end acknowledgements in order to 
advance its sender window. However, unlike TCP, HPF has 
: interleaved reliable, unreliable delay-bounded, and unreliable 
; best-effort packets. Thus, the semantics of cumulative acknowl- 
i edgements are different in HPF. We describe the HPF flow con- 
trol algorithm in more detail below. 

t Connection establishment and teardown: Connection es- 
lablishmem and teardown in HPF arc identical to TCP The 
SYK, SYtf-ACK, ACK, ? IK\ and FIN+ACK control packets are 
marked high priority. 
There are two key aspects of the WRTF sub- layer: rstrans- 
[ missions, and jlow control We describe each below. 

2.1 Retransmissions 

The retransmission policy for reliable and unreliable best ef- 
fort packets is simple: a reliable packet is retransmitted till it 
is acknowledged, while an unreliable best effort packet is never 
retransmitted. 

The retransmission policy for an unreliable delay- bounded 
packet is more complex. While the WRTF sub-layer estimates 
that the packet can be retransmitted and acknowledged before 
its deadline, the packet is retained in the retransmission queue. 
At any time, WRTF sub-layer has the current estimates for the 
transmission rale, round-trip time, and retransmit timeout values 
from the CC sub- layer. At some time t, consider an unreliable 
delay-bounded packet p of size b bytes with deadline'^, trans- 
. mission rate of r bytes/ms. and round trip time of T ms. Thus, 
ihc slack. s t for the packet p is s - d - 1 in ms. The WRTF sub- 
layer wilt keep p in its retransmit queue if [b/r + T < i.e. 
WRTF estimates that p can be retransmitted and acknowledged 
before its deadline. Once WRTF determines that p cannot meet 
its deadline, it discards p from its retransmit queue. 

Note that since r and T are estimates, HPF does nDt provide 
delay guarantees for packets. Of course, HPF cannot enforce 



delay bound since it is purely an end-io-end protocol. However, 
the CC-sublayer does provide fairly accurate estimates of r and 
T when the dynamics of the network arc not severe. 

B.2 Flow Control and Window Advancement 

Since HPF guarantees in- sequence delivery of packets, the 
window advancement algorithm at the sender and receiver arc 
closely related to the retransmission policy described above. 

Figure 2 shows the flow control and window advancement 
parameters in HPF. At the receiver side, the packets between 
read-nxt and rcv-nxt can be passed up to the AF sub- 
layer (possibly with holes due to lost unreliable packets). The 
rcvjixt packet has not been received, and is either a reliable 
packet or an unreliable delay-bounded packet whose deadline 
has not expired. At the sender side, all packets up to sncuuna 
have been acknowledged and may be discarded. The key issue 
in flow control and window advancement is to determine how 
rcv-nxt and sncLuna are advanced at the receiver and sender 
respectively. 

For simplicity of presentation, we consider packets to be se- 
quenced in increments of 1 rather than following byte sequenc- 
ing. We describe the window advancement algorithm for three 
cases, in increasing degree of complexity. 

Case 1: All packets are reliable: In this case, window advance- 
ment in HPF is identical to TCP. When a receiver receives a 
packet pj with sequence number s. it can set rcv_nxc to s + 1 
if the current value of revjrect is a. Likewise, if the current 
value of rev jnxt is s and the receiver has already received the 
packet p B , then the receiver can advance revjrwet in a- + 1. 

Case 2: Packets arc cilhcr reliable or unreliable best- effort: For 
this case, the semantics of rcv_nxc are the following: every 
reliable packet with a sequence number less than rcv_n>:c has 
been received. Consider a packet p Si h> where 3 denotes the se- 
quence number of the packet and h denotes the sequence num- 
ber of the last reliable packet preceding p Jift . The receiver can 
set rcv_nxt to s -r 1 if the current value of rcv_nxt is greater 
than h. We say that p,,h depends on the packet with sequence 
number h. Likewise, if the current value of rcv_nxt is h, then 
the receiver can updnte the value of rcvjiJct to the highest 
sequence number s, such that (a) the packel p Mth has been re- 
ceived, and (b) the transitive closure of the packets on which 
Pt.h 'depends 1 has been received. 

In essence, the receiver can advance its receiver window be- 
yond the last packet p it has received such that every high prior- 
ity packet preceding p has been received. The loss of a reliable 
packet will stall the progress of the receiver window. On the 
other hand, the loss of an unreliable best-effort packet will not 
stall the progress of the receiver window, though it will cause a 
hole in the sequence of packets sent to the AF sub-layer at the 
receiver. Delayed out-of-scquence unreliable best-effon packets 
are discarded, because the receiver window would have already 
advanced beyond the sequence numbers of such packets when 
they arc received. 

Case 3: Packets are reliable, unreliable delay-bound, or unre- 
liable best-effort: This case is more complicated that the ones 
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before because delay-bound packets may be retransmitted, but 
cannot be treated like reliable packets since they may be dis- 
carded upon violiKion of their delay bounds. Thus, the loss of 
an unreliable delay-bound packet may or may not require the 
receiver window to advance, depending on whether the dead- 
line of the packet has expired or not. However, the receiver has 
no notion of when the deadline of a delay-bound packet has ex- 
pired. Thus, the sender must treat the unreliable delay-bound 
packet like a reliable packet while its deadline has not expired 
and like an unreliable best-effort packet after its deadline has 
expired, and provides notifications to the receiver on how tar it 
can advance its receiver window. 

Consider a packet p, tht „, where s denotes the sequence num- 
ber of the packet, h denotes the sequence number ol the last reli- 
able packet preceding p MjltW . and w denotes a sequence number 
specified by the sender. When the receiver receives the packet 
L h .... it will set rcv-nxc to mnx(rcv-nxt. vt) if the currznt 
vuiue of rcv-nxc is greater than h. Likewise, if the current 
value :>f rcv-nxc is Ji.ihcn the receiver can update the value of 
rcv-nxt to the highest sequence number w, such that (a) the 
packet d,xu, has been received, and (b) the transitive closure of 
the packets nn which jw,.*, 'depends* has been received. 



transmit algorithm in Section I1I-B.1 m the sender. 
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Fir. 2. Illustration of Uw window advancement algcriilun in HPF. 

In essence, the sender uses the value w in each packet to 
control window advancement at the receiver. For an unreliable 
dclav-bound packet p with sequence number .s. while it has not 
been discarded from the retransmit queue, the xu field of each 
outgoing packet will be upper bounded by w. Once p is dis- 
carded from the retransmit queue due to delay violation (or if 
it has been acknowledged), the ut value in subsequent packets 
transmitted from the sender can be greater than s. The receiver 
sets its receiver window to the maximum it« among all those re- 
ceived packeis p. such that the transitive closure of packets on 
which p depends has been received. Figure 2 illustrates a simple 
example of this algorithm. 

Note that Case 1 is a special case of Case 3 when a packet 
with the sequence number ol ' s can be denoted by p Mt *-i,g+i* 
and Case 2 is a special case of Case 3 when a packet with the se- 
quence number of a that depends on a packet with the sequence 
number of h can be denoted by * 
In the WRTF sub-layer, we implement the approach described 
in Case 3 for window advancement at the receiver, and the re- 
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Fisure 3 shows a typical snapshot of the state at the WRTF | 
layer Each packet p,, h , w contains three sequence numbers: a 
p . s, p . h. and p . w. Determining p . h for a packet is simple J 
- the sender maintains the sequence number h of the last high ^ 
reliable packet that has been queued for transmission, and every 'j 
subsequent packet that is queued for transmission has p. h = s 
h. The sender maintains three queues of packets: a send queue j 
sque which queues the packets written to by the sender but not } 
yet transmitted, a retransmit queue rque which queues packeis ? 
that can be retransmitted, and a deadline queue dque which is a \ 
queue of pointers to the unreliable delay-bounded packets in the j 
retransmit queue. All queues are ordered by sequence number. 
The p . w value of an outgoing packet is the sequence num- 
ber of the head of line packet in dque. The retransmit pol- 
icy governs when a delay bounded packet is deleted from the 
send/retransmit queue. 

Once the receiver computes the rcv-nxt value, it can 
compute the advertised receiver window sire from Figure 
3. A Vr'RTF-acknowlcdgemem from the receiver consists 
of two parts: the cumulative acknowledgement (ceck), and 
the advertised window (adv). The receiver sets cack to 
rcv-nxt, and adv to rcv-adv - rcv-lasc in its WRTF- 
acknowledgements. When the sender receives a WRTF- 
acknowledgemem from the receiver, it sets snd.una lo cack. 
and sets write_max. to snd-nxt + max{adv. cwnd). The 
congestion window, cwnd, is computed by the CC sub-layer 
and passed on to the WRTF sub-layer. 

Figure 4 provides an outline of the pseudo-code for the win- 
dow management and flow control algorithms in HPF. 

Support for delay-bounded packets in HPF does not require 
the receiver to know which packets are delay bounded, or main* 
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Fig, i. Pwudo-Crxte fur fccy pieces of tlte Sender and Receiver Flow Control 
Algorithm. 



tain any kind of clock at the receiver. As a result of our minimal- 
i>:tic approach, the timing semantics for delay-bounded packets 
in HPF are 'mostly within deadline* delivery rather than 'guar- 
anteed within deadline* delivery for those packets that are re- 
ceived at the receiver. This is an acceptable model for most 
applications thai need real-time support tn the Internet. Thus, 
HPF can support heterogeneous flows with interleaved reliable, 
unreliable delay-bounded, and unreliable best-effort packets, in 
which the applications specify the policies but the mechanisms 
arc provided in the kernel. Figure 2 shows an example of how 
this approach works. 

C Application Interface 

We now revisit the AF sub-layer for the purposes of describ- 
ing the application interface lor HPF. As mentioned before, a 
unique feature of HPF is that it allows applications to specify 
per- frame policies for reliability, priority, and timing. HPF pro- 
vides a socket interface very similar to the stream socket inter- 
face for TCP. Each application creates arid binds sockets with 
the socket ( ) and bind ( ) calls respectively. For HPF. we 
specify a new socket type called SOCK-HP? , The server then 
makes the listen [ ) and accept t ) calls, while the client 
makes the connect ( ) call. Once an HPF connection is estab- 
lished. HPF provides special mechanisms for application-level 
framing, and the specification of per- frame reliability, priority, 
and timing requirements, as described below. 

An application may read from or write to a HPF connection 
in two modes: stream moth, frame mode. In stream mode, 
data is transmitted over the network as a continuous stream of 



bytes (with holes in between where unreliable packets are lost). 
In frame mode, data is transmitted over the network us a se- 
quence of frames {with holes in between where parts of unreli- 
able frames are lost - the loss of any pari of a frame causes the 
loss of the entire frame). In the same flow, an application may 
interleave the two different modes when reading or writing. Be- 
sides, the writer and reader need not be in the same mode: in 
particular, the writer may be in frame mode while the reader 
may be in stream mode. Figure 5 shows the difference between 
reading and writing in the two modes. 
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Fig. 5. ReadAVrite in different modes. This figure slraws the different re- 
sponses that applications wilt see depending on the read and write modes. 
The shaded box indicates a tost packet. An "X" in a receive buffer indicates 
invalid dau. 

Writing to a HPF connection: An application writes to a 
HPF connection using the HPFwrite ( ) socket call, similar 
to the write() socket call, but with additional fields. The 
HPFwrite ( ) call has the following policy parameters: 
» socket descriptor, buffer, length: These are the standard 
write parameters. 

• mode: The mode field can be either STREAM or FRAME. In 
stream mode, each packet is considered to be an independent 
unit, while in frame mode, all the packets belonging to the same 
frame are considered to be a single unit. 

• reliability field: The reliability field may be cither RELIA3LE 
or LINK EL I ABLE. 

• priority field: The priority field can be 0, I. 2, or 3. in de- 
scending order of priority. 

• frame field: In streum mode, the frame field is irrelevant since 
the AF sub-Inyer will set the frame field to 1 for each packet. In 
frame mode, the application will set the frame field to 1 to de- 
note the end of a frame, tmd 0 otherwise. Thus, multiple writes 
may constitute a single frame. 

• delay field: The delay field ts a 16 bit unsigned short inte- 
ger denoting the delay bound (relative to the time of writing) in 
milliseconds. 

Jn frame mode, the AF sub-layer will buffer the entire frame 
before packctizing it and forwarding it to the \VRTF sub-layer. 
For a frame that is written using multiple HPFwrite ( ) calls, 
only the frame policy specified in the last write are significant, 
i.e. all packets in the frame will contain the same options speci- 
fied in the last write for the frame. 
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Reading from a HPF connection: An application reads from a 
HPF connection using the HPFresc ( ) socket call, similar to 
the rsecH ) socket call, but with five fields: socket descriptor, 
buffer, length, mode* and status. The status field is a pointer to 
an unsigned character, and is a return parameter. 

If the mode Held is F3AM3, then the KPFread t 3 call fills 
the buffer with the next completely received frame, and returns 
the size of the buffer read. If the buffer size is less than the frame 
size, the remaining contents of the frame are lost, and an error 
notification is set through the status field. Note that any partially 
received frames are discarded at the receiver. 

If the mode field is STREAM, then the HPFread { ) call re- 
turns a maximal sequence of cither valid data or invalid data, 
bounded in size by the 'length' parameter of the KPFread ( ) 
call. By maximal sequence, we mean that either all the data 
returned by a read call is valid, or all of it is invalid (in case 
of lost packets in the packet How J. The return value of the 
KPFread ( ) call specifics the size of the read, and the status 
field indicates whether the buffer contains valid data or invalid 
daui. 

If the writer is in stream mode, I hen each individual packel 
is treated like a separate frame and it is immaterial whether the 
reader is in stream or frame mode. However, if the writer is in 
frame mode, the reader will experience different responses from 
the connection depending on whether it is in stream mode or 
frame mode. If the reader is in stream mode, it can read parts of 
a frame even if some other parts of the frame were lost. How- 
ever if the reader is in frame mode, it can read a frame only if 
the entire frame were correctly received. Note that the reader 
can migrate between the stream mode and the frame mode. Fig- 
ure 5 shows an example of different read and write modes. 

Retrieving Connection State: In order to support application 
adaptation, we provide the ability for the application to re- 
trieve the current rate and expected latency of the connec- 
tion via gecsockopt { ) calls. The latency value returned 
is half the estimated RTT, while the rate value returned is 
max {ss thresh, cwnd/R??}. These parameters are de- 
signed to serve as coarse estimates for the application to adapt 
in the Ions lerm. 



UDP under congestion. Performance comparison to RTP is on* 
going work. 

In this section, we present four tests. First, we compare TCP 
with HPF with random packet drops in the network ranging from 
5% to 20%. Second, we compare the good put of TCP and HPF 
for different ratios of high and low priority packets in a het- 
erogeneous packet flow, Third, we run tests over the Internet 
to measure impact of HPF over long-haul connections. In the 
absence of network support for priority-based packet dropping, 
priorities for unreliable packets have no impact on the perfor- 
mance of HPF. Test 1 - 3 were performed without assuming 
any special network support, and thus we only used two lev. 
els of priority - high for reliable packets and tow for unrein 
able packets. For lest 4, we instrumented the router to support 
priority- based packet dropping in order to demonstrate the co- 
ordinated adaptation of HPF and the network for an MPEG flow 
with multi-priority frames. We have not considered unreliable 
delay-bounded packets in the performance evaluation section of 
this paper. 

The tcstbed environment is shown in Figure 6. The dedicated 
network consists of star topology with four computers all run- 
ning Linux 2.0,33. and connected via point-to-point 10 Mbps 
Ethernet links. Note that, we have throttled the link between 
hosts atri and durga to approximately 1 Mbps by introducing a 
J 0ms delay between successive packet transmissions in order to 
allow us to create network congestion by overloading this bot- 
tleneck link with packets. 
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Fig. 6. The experimental tcsibcd configuration used for the performance tests. 
The link between hosu durga and airi is ihronlc J in order lo create conges* 
lion, reduce the apparent bandwidth of t tic link, and increase the delay for 
packcu iraversiDS through the linL 
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O. TCP backward compatibility 

HPF uses a 3-way handshake protocol for connection es- 
tablishment identical to TCP. In order to preserve backward 
compatibility with TCP for homogeneous reliable flows, the 
client and the server negotiate the protocol by means of the fol- 
lowing mechanism during connection establishment: HPF in- 
cludes special options after the TCP header in the SYN and 
SYN+ACK packets. An application can enable HPF options via a 
setsockopt { ) call prior to the connect ( ) or accept I ) 
calls. If both end-points include the HPF option fields during 
connection setup, then an HPF connection is established. Oth- 
erwise a standard TCP connection is established. 

IV. performance Measurements with HPF 

We present a preliminary performance evaluation of HPF. In 
particular, we are interested in comparing HPF with TCP 'ind 



A. HPF versus TCP with different random packet drop percent- 
ages 

In the first test, we compare the performance of HPF and TCP 
with different percentages of random packet dropping. Table i 
shows the total time in seconds spent to receive 2M13 of data for 
both TCP and HPF. Data traffic goes from radha to atri without 
congestion in this test. \2,5% of packets are of high priority. 
We modify the kerne! on radha such that it randomly selects 
Oft, 5%, 10£> and 20% of pockets to drop in four different tests. 
From the 0% loss experiment, we notice that HPF introduces 
an overhead of 0.004 seconds for a 2 MB file transfer. As the 
percentage of packets dropped starts to rise, HPF performs faster 
than TCP. Since dropping a low priority (unreliable) packet does 
not trigger a retransmit, the total time needed to complete the file 
transfer with 5% loss is smaller than in the no loss case. In fact, 
as long as the dropping probability is less than the percentage 
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of high priority packets, the cost of retransmitting high priority 
ackets is compensated by the smaller number of low priority 
oackcts received. Note that our comparisons of HPF and TCP in 
[his experiment are inherently unfair wiih packet drops because 
j4pF is transferring fewer bytes than TCP, which is reliable. The 
nuroose of this example is not to show that HPF is faster than 
yCP, but that the rate of data transfer of the high priority (reli- 
able) packets degrades more gracefully for HPF than TCP under 
packet loss. Thus, for applications that can tolerate loss, HPF 
degrades more gracefully with increase in packet loss. 

TABLE I 

The speedup of HPF versus TCP with different percentages of 

RANDOM DROP. 



Perccnl 


TCP 


HPF 


HPF bytes 


TCP/HPF 




(Sec) 


(Sec) 


dropped 


speedup 


{>% 


13. 80S 


13.810 


0% 




5% 


24.50 


13.37 


4.5% 


1,84 


t0% 


53.26 


13.73 


8.4% 


3.56 


20% 


149.91 


17.33 


16.3*5, 


5.65 



C HPF versus TCP over the Internet 

For the third test, wc created a one way tunnel from UIUC 
to Boston with 15 hops. We send HPF flows with different 
High:Low ratios ranging from 1 00% high priority packets (TCP- 
like) to 100% low priority packets (UDP-like). Table III shows 
the results. While UDP (not shown in table) performs very badly 
in this case, losing as many as 75% of the packets due to un- 
regulated transmission. HPF provides a very attractive trade-off 
between the number of packets lost and transmission rate. In 
particular, even with 100% low priority packer, HPF lost only 
about &% of the packets but improved the apparent throughput 
of Ihc connection by about 75% over TCP, Of course, HPF per- 
forms significantly better than UDP since it performs conges- 
tion control. The congestion control algorithm of HPF is also 
the reason for its superior performance over TCP even for the 
scenarios with small fractions of packet toss. This test shows 
that even over the Internet, without any link- layer support for 
priority dropping, HPF can significantly improve the apparent 
quality of a connection for loss tolerant flows. 
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S. HPF versus TCP with different priority ratios 

For the second test, three different 2MB data flows were trans- 
mitted from radha to atri using various ratios of high priority 
(reliable) to low priority (unreliable) packets. During the trans- 
mission, three unregulated flows offered congestion in the form 
of multiple bursts of UDP packets at 60 millisecond intervals 
from tester to atrL The length of time required to complete the 
transmission of the 2MB data flows was then recorded over mul- 
tiple runs. Table li shows the results, averaged over the three 
data flows. The different combinations of the High:Low ratio 
used applies to all of the data flows. Note that as the fraction 
of low priority packets increases . HPF performs faster than TCP 
at the expense of low priority packet loss. Specifically, for the 
Uigh:Low ratio of 5:95. ws observed a 41% reduction in the 
overall transmission time of the data flows at the expense of 
a 23% loss in low priority packets and no loss in high prior- 
ity packets {which are retransmitted till successfully acknowl- 
edged). 

As in the previous example, our intention is not to perform 
a direct head -to-head comparison of HPF against TCP because 
the two protocols are not transmitting the same amount of data. 
The goal is lo show how applications perceive the progress of 
flows with different High:Low packet ratios in scenarios with 
high bursi losses. 

TABLE II 

The performance of HPF versus TCP for various HlGH:Low 

PRIORITY RATIOS WITH MULTIPLE CONCURRENT STREAMS. 



TABLE 111 

THE PERFORMANCE OF HPF VERSUS TCP FOR VARIOUS HJGH:LOW 
PRIORITY RATIOS OVF.H AN IP TUNNEL, 



Protocol 


High: Low 


Packet* 


Time improvement 


Protocol 


Ration 


Dropped 


vs TCP 


TCP 


ail high 


0% 




HPF 


50:50 


4.97 1 0ft 


9.9421% 




33:66 


7.625E% 


]6.B454% 




2thK0 


12.5544% 


I7.79DS% 




10:90 


16.9267% 


34.756S% 




5:95 


23.7603% 


41.5078% 



Protocol 


High: Lew 
Ratio 


Packets 
Dropped 


Improvement 
vsTCP(iimc) 


TCP 


all high 


0% 




HPF 


66:33 


0.71% 


31.77% 




50:50 


2.86% 


51.60% 




33:66 


4.29% 


64.76% 




10:90 


5.71% 


69.0K% 




5:95 


7.S6% 


69.93% 




0:100 


7.S6% 


75.07% 



D. Impact of priority-based packet dropping in the network on 
HPF 

For the fourth test, we added network support in the form of a 
*HPF priori! y-uware* Round Robin (RR) packet scheduler that 
provides a separate FIFO queue for each flow, and priority-based 
packet dropping upon buffer overflow. For this test, wc ran the 
modified schedufer at durga, and we created two HPF flows 
from radha to atri and two from tester to atrL The High:Low 
ratio was set lo 5:5 for all flows. The result was that the time 
improvement over TCP for this case was 57.99b as opposed to 
29,7% without the modified scheduler. Note that the number of 
packets dropped at durga did not change; however, lower pri- 
ority packets were dropped rather than higher priority packeus, 
thus improving both the perception of the connection quality and 
the throughput (due to fewer retransmissions). 

As a further illustration of the use of priority-based packet 
dropping in the network with HPF, wc ran a modified VCR 
client-server application that sends MPEG-1 streams over HPF 
[1] with and without priority-based packet dropping. The modi- 
fied VCR program uses reliable packets for control information 
and unreliable packets with descending levels of priority for I- 
frames, P- frames, and B-frames respectively. 

We ran two experiments with a VCR client at atri and a VCR 
server at radha, and introduced periodic UDP packer bursts from 
tester to atri in order to cause congestion over the durga to atri 
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Unk. In the first experiment, ttitrgu ran a Round Robin packet 
scheduler but without priority-based packet dropping, In this 
case, wc observed that the loss in the 1- frames was 20%, P- 
frames was 20%, and B- frames was &%. This result is intu- 
itively justifiable, since I-frames and P- frames are larger than 
B -frames, and the loss of even one packet in a frame led to the 
loss of the entire frame (the client reads in frame mode). In 
the second experiment, dur%a ran a Round Robin packet sched- 
uler with priority-based packet dropping. In this case, we ob- 
served that the loss in the I-framcs was Q%, P-frames was 09o, 
and B-trames was 24%. This result was induced by the fact 
that lower priority packets belonging to B-frames were prefer- 
entially dropped during congestion. This simple test points to 
the obvious usefulness of augmenting HPF at the end hosts with 
network support in the form of priority- based packet dropping. 

V. Conclusion 

In this paper, we have described three key features of HPF: 

• The ability of HPF to support heterogeneous packet flows, 
wherein different packets/frames may have different quality of 
service requirements in terms of reliability, priority, and dead- 
lines. 

. The ability of IJPF to support application-level framing, and 
provide APIs lor applications to specify the priority, reliability 
and timing requirements of each frame. 
. The ability of HPF to propagate application-specified prior- 
ities us hints for network routers to preferentially drop low- 
priority packets during congestion. 

We have implemented and used HPF for close la a year now. 
From our experiences with HPF, we have observed the follow- 
ing: 

• HPF works fairly well without any network feedback, and 
substantially improves the apparent network quality for loss tol- 
erant flows. HPF improves significantly with even minimal net- 
work support. In particular, prioriry-based packet dropping in 
the network augments the performance of HPF perceptibly. 

• HPF can support delay-bounded packets without requiring 
any clocks at the receiver. This is because of the 'mostly before 
deadline* semantics of delay-bounded packet delivery in HPF, 
and also because deadlines for HPF packets include the time fur 
the acknowledgement from the receiver to reach the sender. This 
is a very attractive feature for supporting real-time flows in an 
Internet environment. 

• HPF can eliminate the problems associated with synchroniza- 
tion of multiple streams of packets because it can support hetero- 
geneous packet flows, and ii provides sequencing of such flows. 

• We have found that in terms of performance, applications 
which specify per- frame policy parameters to HPF and let HPF 
do the adaptation consistently performed better than applica- 
tions that performed adaptation themselves - either running over 
HPF or running over UDP This is a direct consequence of the 
fact that the adaptation mechanisms of HPF are kernel-level, and 
hence more efficient. 
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Abstract: Combining hierarchical coding of data with receiver-driven control appears to be 
an attractive scheme for the multicast transmission of audio/video flows in a heterogeneous 
multicast environment such as the Internet. However, little experimental data is available 
regarding the actual performance of such schemes over the Internet. Previous work such as 
that on receiver driven layered multicast uses join experiments to choose the best quality 
signal a receiver can subscribe to. In this paper, we present a receiver-based multicast rate 
control mechanism based on a recently proposed TCP-friendly unicast mechanism. We have 
implemented this mechanism and evaluate its performance in conjunction with a simple 
layered audio coding scheme. We find that it has interesting convergence and performance 
properties, but also bring out its limitations. 

Key-words: Audio coding, congestion control, hierarchical coding, Internet, multicast 
transmission, 
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Etude et mise en oeuvre d'un sch§ma de transmission 
hierarchique sur Internet 

Resume : La transmission audio/video en multipoint dans un environ nement heterogene 
comme llnternet souleve de nombreux problernes. L'utilisation d'un schema de transmis- 
sion qui combine un codage hierarchique et un controle de transmission oriente-recepteur 
apporte une solution elegante au probleme de Pheterogeneite des recepteurs. Cependant, 
peu de donnees experimentales sont disponibles sur les performances effectives de tels sche- 
mas de transmission dans Tlnternet. Dans Tune des approches existantes, la transmission 
hierarchique oriente-recepteur, le recepteur choisit le nombre de couches auquel il s'abonne 
en fonction de d'abonnemeuts test prealables. Dans cet article, nous decrivons un meca- 
nisme de controle de transmission en multipoint oriente-recepteur base sur un mecanisme 
TCP-friendly recemment propose pour la transmission en point a point. Nous avons mis en 
oeuvre ce mecanisme en conjonction avec un schema de transmission audio hierarchique et 
nous evaluons ici ses performances et limitations. 

Mots-cles : Audio, codage hi&rarchique, controle de congestion, Internet, transmission 
multipoint hierarchique 
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1 Introduction 

The transmission of real-time audio and video over the Internet has been much in the 
news recently. In particular, the transmission of voice has achieved high visibility because of 
technical, financial, and regulatory issues related to so-called "Internet telephones" (e.g. refer 
to [17, 3S]), to the rapidly increasing number of companies selling or giving away Internet 
telephony software, and to the increasing traffic generated by these telephones. Even though 
most of the commercial companies involved in the Internet telephony business date the 
beginning of Internet telephony to the first release of VocalTec's "Internet Phone" [37) in 
1995, the transmission of unicast audio dates back to the 70's [39] r , while the transmission 
of multicast audio started "officially" at the March 1992 IETF [4j. However, it is true that 
only recently has audio and video traffic made up a non-negligible fraction of the traffic 
routed at some nodes in the Internet. In any case, this traffic is expected to increase for 
at least two reasons. First, a rapidly growing number of tools are available (some of them 
for free) that provide low bandwidth but decent to excellent quality audio and video coding 
[9]. Second, such tools are being incorporated in a growing number of applications such as 
collaborative working applications [9|. 

The wide use of unicast and multicast multimedia tools in the Internet raises two im- 
portant related questions. First, what is the impact of a much increased multimedia traffic 
on other traffic? Second, what kind of multimedia quality can users expect given that the 
network does not provide guarantees on delay, jitter, bandwidth, or loss rate? The impact of 
multimedia traffic on network and application performance is potentially quite large because 
the vast majority of currently available tools send data at a rate which does not depend on 
network state. For audio tools, this rate would be a constant rate corresponding to the 
chosen coding scheme (64 kb/s for G.711 coding) in the absence of silence detection. Such 
a behavior is unlike that of rate controlled applications, such as TCP-based applications, 
which adjust their output rate and hence their bandwidth requirements depending on the 
state of the network. Thus, non rate controlled applications are "bad citizens" since they 
share network resources with TCP-based applications in unfair ways, and their rapidly in- 
creasing use in the Internet might cause long-lived, severe congestion problems (especially 
for TCP users). 

There are two ways to prevent this from happening. One way is to replace the FIFO 
scheduling algorithms in Internet routers with RED-like algorithms that "punish" aggressive 
flows and reward rate controlled applications [13]. Another way is to incorporate in ap- 
plications TCP-friendly rate control mechanisms which make sure resources are not shared 
unfairly with TCP connections, and suitable for both unicast and multicast transmission. 
This latter way can be thought of as a special case (namely rate adaptation) of a more 
general approach which aims at adapting application behavior to network characteristics, 
the goal being to maximize the quality of the data delivered to the destinations. Other 

" Also see the discussion on the AVT mailing started on Sept. 16, 1396 by Ed Ellcsson regarding a 
Tadteilzed Audio Patent". 
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kinds of adaptation include adaptation to delay jitter by playout adjustment schemes [29| 
or adaptation to loss by ARQ- or FEC-based error control schemes [11, 33, 2S]. 

The second way above (namely incorporating rate control mechanisms in applications) 
does not require modifications to router software and hence can be implemented in the 
current Internet. However, none of the commercial tools we are aware of uses any land of 
rate control *. This is not so surprising after all since non rate controlled applications tend 
to "steal" bandwidth from rate controlled applications, and thus provide better quality to 
their users (this will be illustrated in Section 4). This makes it all the more important that 
all applications be rate controlled since the current Internet does not provide incentives to 
behave in a fair way. 

Unfortunately, the design of rate control schemes for multicast real time applications is 
not a simple extrapolation of the TCP-like source based control schemes used by unicast 
data applications for two main reasons. First, source based rate control schemes using 
feedback information about the state of the network do not scale well with the number of 
participants in the multicast group, implying that the rate of exchange of any information 
between participants needs to be carefully controlled |32 : 3|. Second, source based schemes 
do not scale well with the level of heterogeneity in the branches of the multicast tree and/or 
in the participants. This is because the source can adjust the rate of a stream to match 
the requirements of one participant (i.e. to adapt to the state of one branch in the tree), 
but it cannot match the conflicting requirements of multiple heterogeneous participants (i.e. 
adapt at the same time to different states of different branches). Various approaches to this 
problem have been described in the literature, using gateways [27, 1), simulcast transmission 
[7|, and layered transmission coupled with layered coding |22, 23]. 

We focus on the layered transmission / layered coding approach in this paper. While this 
approach is attractive and has been advocated for some time now, it has not yet been de- 
ployed over the Internet (although this is expected to change real soon now). The approach 
relies on the availability of two components, a layered coder and a layered transmission 
scheme. With layered coding, source data is encoded into a number of layers, or (sub)bands, 
that can be combined to reconstruct a signal that gets closer to the original signal as the 
number of combined layers increases. Layered coders for audio and video have been de- 
veloped over the past few years, but few coders have low enough CPU requirements as to 
be useful in software only tools. Recent work has produced such low CPU layered video 
coders J23, 5]. However, we are aware of little equivalent work for audio coders. Thus, we 
have developed simple layered schemes for audio based on the idea of temporal subsampling, 
which are described in Section 2. 

The idea of layered transmission then is to send each layer on a separate multicast 
group, with receivers deciding to joint/quit a group (i.e. to receive/drop another layer) on 
their own. The basic principles of layered transmission have been known for a while (see 
references in |22|). However, a specific scheme specifying how/when receivers are to join 
and quit layers has only recently been described and simulated. This scheme, referred to as 

t It is hard 10 blame Lhc commercial tools for behaving so as they only mimic research tools in this 
respect. None of the more popular MBone tools such as VAT[35] uses any kind of rate control either. .. 
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Receiver driven Layered Multicast (RLM), uses "probing" experiments similar to those used 
by TCP, to decide when to join and quit layers. We propose a layered transmission scheme in 
which RLM's join experiments are replaced by an explicit estimation at each receiver of the 
bandwidth that would be used by an equivalent TCP connection (the notion of equivalent 
connection will be defined later) between the source and the receiver. This estimation is 
done using the same technique as that recently described to design TCP-friendly unicast 
rate control scheme [20]. Thus, we obtain a TCP-friendly receiver-based multicast rate 
control mechanism. We have implemented this scheme and evaluated its performance and 
ii nutations on the MBone. 

The rest of the paper is organized as follows. In Section 2, we describe simple layered 
audio coding schemes. In Section 3, we describe the receiver-based control scheme. In 
Section 4, we describe the experimental settings and discuss the results. Section 5 concludes 
the paper. 

2 Simple layered audio coding schemes 

Layered coding is a family of signal representation techniques in which the source information 
is partitioned into a sets called layers. The layers are organized so that the lowest, or 
base layer, contains the minimum information for intelligibility. The other layers, called 
complementary layers, contain "add-on" information which improves the overall quality of 
the signal. Usually, layered encoding schemes are organized so that some layers (in particular 
the base layer) are mandatory to reconstruct a coherent signal. Using such schemes then 
requires either that the net be able to discriminate between packets and provide packets 
that carry important information with a guaranteed performance (in particular guaranteed 
maximal loss rate) service, or that these packets be "protected" against loss. This can be 
done using a variety of so-called unequal error protecting schemes, which have been the 
subject of much research effort recently (e.g. [16, 10|. 

Our goal was not to develop or experiment such schemes, but instead to experiment with 
rate control schemes. Thus, we have focused on simple coders with low cpu requirements and 
no need for unequal error protection schemes. We describe next a simple layered encoding 
scheme in which all layers have the same importance, meaning that the loss of information 
from one layer does not impact quality more than the loss of information from another layer. 

2.1 The basic subsampling scheme 

The simplest balanced scheme one can think of is based on straight subsampling The coding 
algorithm is based on a temporal subband decomposition. Consider for example the case of 
a PCM (Pulse Coded Modulation) encoded 8kHz audio signal decomposed into 3 PCM 2.7 
kHz flows as shown in Figure 1. The temporal decomposition algorithm is carried out at the 
source done for each audio chunk (which typically includes 20ms, 40ms or 80ms of audio). 
At a destination, a receiver which receives all 3 flows can retrieve the original input signal. 
If one or two flows are missing, the destination uses the samples received to reconstruct an 
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Figure 1: Hierarchical coding scheme 



approximation of the original signal. An upsampling one-to-three flows is shown in Figure 2. 
Clearly, the larger is the number of received flows, the better the quality of the reconstructed 
signal. 
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Figure 2: Reconstruction 1 to 3 



Note that temporal decomposition handles signals sampled with different sampling rates. 
For example, a 43 kHz audio signal can be decomposed into three 16 kHz subflows, each 
of which can be decomposed into two 8 kHz subflows, which finally yields eighteen 2.7 kHz 
audio layers. Of course, temporal subband decomposition can be coupled with compression 
schemes*. 



2.2 Robustness against packet loss 

The audio coding scheme described above has the interesting balanced property mentioned 
above: all the layers have the same importance in the network. However, in case of severe 
congestion, the congestion control algorithm may select to subscribe to only one flow and 

! Refer to the Web page hLip://wwu\iuria.fr/rodco/turleUi/audio/ to retrieve samples showing the im- 
pact on quality of the number of received layers, and of the compression scheme used to encode the layers 
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in this case, the coding lias no mechanism at all to reconstruct lost samples of audio. To 
achieve robustness against packet loss, the receiver could decide to subscribe to at least 2 
flows. We have decided to add to the base flow of each sample, the flow L corresponding to 
the previous sample (L is the number of layers of the coding). So, in such a scheme, the first 
layer has twice the bandwidth of other layers and the transmission control scheme handles 
(L - 1) flows instead of L. Note that contrary to other redundancy schemes, there is no 
bandwidth wasted when packet loss is null since the "redundancy 11 information appended to 
the first layer is always used to improve the reconstruction of the sample. 

Note also that the redundancy added to flow # 1 does not imply that this flow has a 
higher priority since flow # 1 is not required to receive it correctly to decode the other flows. 
Whereas there is no flow more important than another one, all receivers have to adopt the 
same order in the subscription algorithm (in order to perform efficient pruning to limit the 
bandwidth sent). 

2.3 CPU and bandwidth cost 

The cpu cost of the hierarchical coding algorithm mainly depends of the compression scheme 
used following the temporal subband decomposition. In our experiments, we have success- 
fully tested the PCM, ADM6, ADM5, ADM4, and ADM3 compression schemes, which have 
very low cpu requirements [14). More efficient compression coder could as well fit into the 
temporal subband decomposition described above. However, very high compression coding 
algorithms do not appear to fit well with layered coding over the Internet because the ove- 
rhead of IP, UDP and RTP[32j headers decrease significantly their bandwidth savings. 

Table 1 shows the bandwidth overhead for layered and non-layered audio coding using 
PCM and ADM4 compression schemes. The IP/UDP/RTP bandwidth headers overhead 
is shown for three sizes of packets corresponding to 20ms, 40ms and 80 ms of compressed 
speech. 

Interactivity needed by audio/video conferencing applications is often cited as requiring 
low packetization intervals of 20ms. However, such a low packetization interval increases the 
number of packets sent per second, especially if several layers are used. Then the bandwidth 
requirement of the IP/UDP/RTP headers may be higher than the bandwidth requirement of 
the actual payload, see Table 1. For example, the bandwidth corresponding to the headers 
for 6 flows is (for 20ms packetization interval) equals to 106 kbps with IPv4 and 144 kbps 
with IPv6. Table 2 recalls the size of IPv4, IPv6, UDP and RTP headers. 

However, experience with audioconferencing tools in the MBone shows that 40 ins and 
80 ms are convenient values especially if redundant techniques (28) are used to minimize 
the effect of large "holes" in the speech. SO ms appears to be a good compromise between 
interactivity needs and bandwidth overhead generated by the hierarchical coding. 

In some cases (e.g. very low bandwidth links), the IP/UDP/RTP headers may be com- 
pressed to reduce the bandwidth overhead |S]. However, such techniques increase the CPU 
consumption at the routers and they are expected to be used on point to point links, but 
not on an end-to-end basis on the MBone. 
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fin P**rli 

Coding/ 
TVajis ni iss i o n 


Payload 
(kbps) 


20 ms 
Total 
(kbps) 
IPv4/IPv6 


buffer 
Efficiency 

/ payload \ 
V header ' 

IPv4/IPv6 


40ms 
Total 
(kbps) 
IPv4/IPv6 


buffer 
Efficiency 

/ payload \ 
V liccdcT t 

IPv4/IPv6 


SOms 
Total 
(kbps) 
IPv4/IPv6 


buffer 
Efficiency 

t payload \ 
\ header ' 

IPv4/IPv6 


PCM 


64 


S1.6/SS.0 


3.64/2.67 


72.8/76.0 


7,27/5.33 


68.4/70.0 


14.55/10.67 


ADM4 


32 


50.1/56.5 


1.82/1.33 


40.8/44,0 


3.64/2.67 


36.4/3S.0 


7.27/5.33 


PCM 
1 fl. 2.7kHz 


21.3 


38.9/45.3 


1.21/0.89 


30.1/33.3 


2.42/1.77 


25.7/27.3 


4.84/3.55 


PCM 
3 fl. 2.7kHz 


64 


117/136 


1.21/0.89 


90.3/100 


2.42/1.77 


77.7/81.9 


4.84/3.55 


ADM4 
1 fl. 2.7kHz 


10.7 


28.3/34.7 


0.61/0.45 


19.5/22.7 


1.22/0.89 


15.1/16.7 


2.43/1.78 


ADM4 
3 fl. 2.7kHz 


32 


84.9/104 


0.61/0.44 


58.5/6S.1 


1.22/0.89 


45.3/50.1 


2.43/1.78 


Overhead (1 fl.) 


N/A 


17.6/24.0 


N/A 


8.8/12.0 


N/A 


4.4/6.0 


N/A 


Overhead (3 fl.) 


N/A 


52.8/72.0 


N/A 


26.4/36.0 


N/A 


13.2/18.0 


N/A 


Overhead (6 fl.) 


N/A 


106/144 


N/A 


52.3/72.0 


N/A 


26.4/36.0 


N/A 



Table 1: Bandwidth overhead for several audio coding/transmission schemes 



Overhead 


IP 


UDP 


RTP 


per packet 


v4 


vG 






(in bytes) 


24 


40 


8 


12 



Table 2: IP/UDP/RTP header size 



2.4 Implementation details 

The hierarchical encoded flows are carried as payload data within the RTP protocol [32], The 
application follows the recommendations defined in the "RTP usage with layered multimedia 
streams" draft [35]. There is no need to manage a different sequence numbering per flow. 
So, in our hierarchical audio application, we further impose that all flows use the same 
sequence number to encode an audio sample. The receiver handles one circular buffer for 
all the layers, so only one playout is computed using the algorithms described in (29]. 
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3 The receiver-based control scheme 

3.1 The RLM scheme 

Receiver driven Layered Multicast (RLM) is the first published scheme which described a 
specific multicast layered transmission scheme and the associated receiver control scheme. 
RLM uses "probing" experiments similar to those used by TCP to decide when to join and 
quit layers. Specifically, when a receiver detects congestion, it quits the multicast group 
corresponding to the highest layer it is receiving at the time (we say that the receiver 
drops the highest layer); when a receiver detects spare capacity in the network, it joins the 
multicast group corresponding to the layer next to the highest layer received at the time 
(we say that the receiver adds the next layer). 

The receiver detects network congestion when it observes increasing packet losses. In 
the absence of loss, the receiver estimates spare capacity, or rather the existence of spare 
capacity, with so-called join experiments. A join experiment; means that a receiver joins the 
next group and measures the loss rate over an interval referred to as the decision time (to 
avoid the synchronization of join experiments, experiments are carried out at randomized 
times). However, the load created by join experiments increases as the size of the multicast 
group increases. To prevent a load explosion, the authors in [22) use "shared learning", 
in which a receiver about to start a join experiments multicast its intent to the group. 
Thus all receivers get to know the result of join experiments carried out by other receivers. 
To prevent join experiment for different layers to interfere with each others, only receivers 
that are doing join experiments for layers equal or below a newly advertised experiment 
can actually conduct their own experiments. RLM with shared learning scales with the 
size of the group, but it raises a few questions. In particular, it is not clear how effective 
shared learning is in the absence of knowledge about the structure of the multicast delivery 
tree. Furthermore, shared learning increases the convergence time to steady state layer 
subscription especially for receivers with spare capacity (since they will have to wait for 
slow receivers to reach their steady state before they can join additional upper layers). Also, 
the times at which layers are added and dropped determine the rate increase and decrease 
along the branches of the tree. Thus, RLM should choose these times appropriately if it 
wants to achieve fairness with TCP traffic. However, this appears hard to do in practice. 

3.2 Basic idea 

Our control scheme aims at providing the benefits of the RLM scheme without (at least 
some of) the costs associated with it. The scheme uses the same idea of a receiver based 
rate control scheme, however join experiments are replaced by an explicit estimation at each 
receiver of the bandwidth that would be used by an equivalent TCP connection between 
the source and the receiver. This estimation is done using the same technique as that 
recently described to design TCP-friendly unicast rate control scheme (20). Specifically, the 
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bandwidth X eqn of a TCP connection can be well represented by Equation 1 below (20, 21, 25] 

MTU m 
A = 1 oo * — . (1) 

where MTU is the maximum packet size used on the connection, RTT is the mean round 
trip time, and Loss is the mean packet loss rate. Now assume that each receiver knows 
the rate A, of the layer i flow generated by the source over the multicast tree. Then, each 
receiver executes the following algorithm: 

Step 0: Upon joining the group, subscribe to the base layer 

Step 1: Measure or estimate MTU, RTT, and Loss 

Step 2: Compute A B , U 

Step 3: Find L, the largest integer such that 

Note that the rate at which data flows between the source and any receiver is equivalent 
to that of a TCP connection running over the same path. Tims, we have obtained a TCP- 
friendly scheme suitable for multicast delivery. Furthermore, note that the scheme does not 
rely on active probing schemes such as join experiments, nor does it require exchange of 
information between participants as is done with shared learning. We examine below details 
associated with steps 1 and 3 above (e.g. the estimation of parameters in step 1). However, 
before doing this, it is worth going back to Equationl and note that i) the equation in fact 
only provides an upper bound to the rate of an equivalent TCP connection, and ii) however 
it has been shown to fit well with measured and simulated performance of a wide variety of 
TCP variants [13]. 

3.3 Parameter estimation 

Step 1 in the list above deals with estimating network parameters. So let us consider all of 
them in turn. 

MTU: The MTU value may be set to the the minimum acceptable value for TCP of 576 
bytes or could be determined using the MTU discovery algorithm [24]. 
Loss: The mean loss rate Loss is easily computed from observed losses (detected using the 
RTP sequence number field) using an exponential filter with a half life equal to n packets 
(or equivalently to d * n sec, where d. denotes the inter-packet arrival). The trick of course 
is to choose an appropriate value for n. We use a value that varies over time as shown in [22]. 

RTT: The mean round trip delay RTT cannot be computed as easily as Loss since the 
information received from the source in RTP SRs (Sender Reports) only refers to the one-way 
delay between source and receiver. Unfortunately, the one way delay can be a poor estimate 
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of RTT/2 |6, 26]. There are 3 ways to tackle this problem. The first way is to estimate RTT 
using twice the delay from source to receiver (which is known using RTP timestamps), and 
assume a symmetric path. However, this clearly breaks down, in particular with asymmetric 
links (such as satellite links). The second way is to augment the capabilities of RTP so as 
to be able to estimate the actual round trip delay, this can be done as follows: the receiver 
sends an rtt-request packet; the sender replies immediately with an rtt-reply packet to the 
receiver; the receiver notes the delay corresponding to the current RTT value and updates 
the RTT estimator with it. Unfortunately, this scheme does not scale to large multicast 
sessions. First, the frequency at which the receiver generates the rtt-request packets should 
be function of the session size (as with RTCP messages {32]) in order to avoid the well-known 
feedback implosion problem. Second, the period between two rtt-request packets should not 
be constant in order to avoid RTT synchronized requests and periodic congestion [12]. Third, 
unicast rtt-requests from the sender may be less efficient than multicast joint rtt-requests. 
We could have used the SR and RR {Receiver Report) packets to append the information 
required to estimate RTT, However, these packets include several parameters that do not 
need to be sent as often than the RTT information. So, in order to save bandwidth, we 
have created two new RTCP packets which include the minimal information needed to 
estimate RTT: rttreq for the rtt-request packets and Htrep for the rtt-reply packets sent by 
the sender, rttreq packets include a 4-bytes RTCP header which identifies the packet and 
encodes the packet length, the SSRC of the receiver, and the time at which the rtt-request 
has been sent at the receiver ts riirc(J . rttrep packets include the 4-bytes RTCP header, the 
SSRC of the sender and a set of RTT information report blocks corresponding to each rttreq 
packets received since the last rttrep report. Each RTT information report block includes 
SSRC of the receiver, the corresponding t r «re./ parameter and the delay 6 in millisecond 
between the time at which the rttreq has been received at the sender and the time at which 
the rttrep is sent to the group, see Figure 3. 



7 15 31 0 7 15 31 




Figure 3: rttreq and rttrep packets 

As soon as a receiver receives an rttreq report, it notes the corresponding receiving time 
trrttre P . Then, if its SSRC is included in the packet, it is able to update the estimation of 
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the RTT with the new value: 



RTT = tTrUrcp " tSrttrcq ~ <5 



(2) 



Note that this mechanism does not assume that sender and receivers clocks are synchronized. 

Let us now examine how tliis parameter estimation mechanism scales with the number 
of receivers A r in the session. To do this, we compute next the interval of time Trtt 
between two RTT estimations for a receiver in the session. Assume that r % of the payload 
bandwidth BP is reserved for this usage. 



BP rtt 



Writrep + N.Wrttrcq 
TjlTT 



(3) 



with W'rurcp* the size of the rtirep packet and W TUTCn , the size of the rttreq packet. Prom 
Figure 3, we have W r ttre P = 64 + 96 jV bits and W rU r* q = 96 bits. So, using equation 3: 



64 + 192.N 
BPrtt 



(4) 



Figure 4 shows- the delay between two RTT updates with N in jl,200] with the following 
parameters: BPrtt = Mfcfi/a, r = 5%. With N = 16 receivers, the RTT estimate is 




Figure 4: Delay between two RTT estimations (in sec) vs N 

updated about each second. When the number of receivers reaches 100, the interval of time 
increases to 6 seconds. Clearly our scheme is not convenient for large groups since it suffers 
from the same scaling problems than RTCP pi]. All the more so since some of the rttreq 
and rttrcp packets can be lost during transmission. 

This has head us to examine a third way to estimate RTT, which in fact attempts to 
bypass RTT entirely in Equation 1. The goal there is to obtain a relation for A e ,„ in ways 
similar to those described in |25, 19] that does not involve RTT, but rather the source to 
destination delay and the inter-packet jitter observed at the destination. The idea is to 
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measure RTT infrequently (e.g using the scheme above), then to track variations in the 
one-way source-destination delay using RTP timestamps, and to track variations in the one- 
way destination-source delay using the jitter at the destination. Indeed, variations in the 
destination-source delay are translated in the source variations in the times at which packets 
are sent and thus into jitter at the destination. Unfortunately, we have not been able to 
derive an approach formula for X ctJU yet, and this remains an area for future research. 

A fourth approach, which we only mention, is to forget about TCP-friendliness altogether 
and instead wait for schemes such as RED to be deployed in the Internet.... 

3.4 Control algorithm 

At the beginning, when a receiver joins a session, it subscribes to the base layer in order 
to estimate the parameters as described above during a time interval T{. Once parameters 
have been estimated, Equation 1 is used to estimate the rate (A C(?U ) of an equivalent TCP 
connection. The value of \ etJU in turn indicates how many layers the algorithm may subscribe 
to so as to behave (at least throughput wise) like a TCP connection. At the initialization, 
the Ti time interval is set to the length corresponding to the exponential filters (for RTT 
and Loss parameters). Then, we ensure that the T { interval be at least greater than the 
mean RTT value estimated in order that the receiver can detect the resulting impact of the 
latest action performed. 

A receiver actually subscribes to extra multicast group(s) when it is allowed to do so. 
Once this is done, it resumes the observation and estimation of the parameters so as to 
determine the impact of the join on network congestion. The receiver won't take any actions 
until it can estimate the effect of its latest action. This period of time equals to the maximum 
between the estimated RTT and the delay corresponding to the filter length. 

The situation is similar, in the opposite direction, when the equivalent rate tells the 
receiver to drop a layer. Finally, if the equivalent rate is less than the minimal rate the 
application can send (i.e. the base band rate), the algorithm should pop up a message to 
the receiver and invite (or force) him to leave the session in order not to congestion the 
network and swamp all the network resources, - Internet collapse could happen for example 
if a large number of receivers continue to subscribe to the minimal quality Sow (i.e. no more 
rate decrease in case of high packet loss observed). To follow the behavior of TCP [18], 
the application can wait for a delay T r before probing the network state by joining the first 
layer. T r = 2* + tr, where k is the number of consecutive failed attempts, and a a random 
value added to avoid synchronization of receivers. 

4 Experimental evaluation over the MBone 

The layer codec and the control which described above have been implemented in an ex- 
perimental version of the FreePhone audio tool [15]. Packets are sent using the usual 
IP/UDP/RTP stack. All the measurements results presented in the paper have been done 
with a hierarchical and non-hierarchical ADM4 coder. Audio packets sent by the coder 
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include speech samples corresponding to 80 ms of speech. They are sent at regular 80 ms 
intervals while the audio source is active. 

We have carried out measurements over several connections including local connections 
(between machines located on the local area networks at the sender side) as well as wide-area 
connections. In the long distance experiments over the MBone, we set the source at MIT 
(Massachusetts Institute of Technology) and receivers are MIT, INRIA (in France), UCL 
(University College London in the UK), and UTS (University of Technology in Australia). 

The MBone delivery tree between source and destinations are shown in Figure 5. 



nDrgot!ia;tc5;mitxdu 
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Figure 5: MBone delivery tree between sender at MIT and receivers 

During experiments with non-controlled rate flows, we have restricted the number of 
layers to 3 in order to limit the network congestion in the network. The experiments done 
with "the congestion control scheme enabled are allowed to use more flows. 



4.1 Influence of the shift parameter 

In the two following experiments, we examine the characteristics of the data received on each 
flow, according to the way they have been sent at the sender side (per burst or at regular 
time interval, i.e. using the shift parameter). In the first experiment, packets are sent per 
burst in the 3 flows at each 80ms interval, whereas in the second experiment, the packets 
from flow 2 and 3 are delayed in such a way that the overall transmission is periodic. In 
both experiments, a total of 32 kbps of payload (corresponding to an ADM4 8kHz flow) has 
been split into 3 flows sent in 3 different multicast group during 7 minutes (which roughly 
corresponds to 5000 packets). 
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Table 3 shows the values of RTT (in ms) and Loss (in %) corresponding to the 3 receivers. 
We note that in the second experiment, the values of RTT and Loss measured at a receiver 
are essentially constant. However, in the first experiment, and especially when the network 
is loaded, RTT and Loss increase with the flow number 5 . To explain this phenomenon, we 
can use the cumulated loss (CL) rate parameter which represents the probability to loose 
all the packets (from different flows) corresponding to a sample. The cumulated loss rate is 
computed at the receiver side, after reconstructing all samples with packets received from 
all the flows. The results are shown in Table 3 for the two experiments. The cumulating 
loss rate for the 3 flows received is about eight times lower in the first experiment than 
in the second one. This means that the higher packet loss rate observed in the second 
experiment is mainly due to the loss of consecutive packets sent. The low value obtained 
in the first experiment (less than 0.5 %) clearly shows that this hierarchical coding scheme 
uses efficiently each flow as a redundancy for the others flows received. We have plotted 
in Figures 6, 7 and S, the evolution of the cumulating loss rate vs the sequence number 
(or sample number) corresponding respectively to INRIA, UCL and UTS receivers. The 
left and the right figures show the cumulating packet loss rate when the packets are sent 
respectively per burst and at regular time interval, i.e. with shift option enabled. We can 
observe that at each receiver, the probability to loose 3 consecutive packets (i.e. an sample 
sent in 3 flows) is very low. 

So the sender's application needs to include a mechanism to avoid sending bursts of 
packets in different flows (in the same way an efficient non-hierarchical sender's application 
acts). Note that the periodical paclcet loss observed at UTS (Figure 8 shows three conse- 
cutive packet loss roughly every 370 packets, wliich corresponds to 30 seconds) is likely due 
to the router's synchronism bug in which routers periodically exchange their routing table 
over the network (12). 

Table 3 also shows that by using the shift, the mean RTT value and the mean packet 
loss rate are nearly constant for the three flows. This makes a nice point about the scalability 
of the scheme since we can choose a single flow to estimate the parameters for the receiver's 
congestion control algorithm. In the rest of the paper, the estimation of the RTT and packet 
loss rate parameters is done on the base flow and experimentations are done using the shift 
option enabled. 

4.2 Evaluation of the congestion control algorithm 

Once we have studied the correlations of flows on the MBone, we can evaluate the congestion 
control algorithm described in section 3. 

In a third experiment, the source at MIT sends 5 flows corresponding to the hierarchical 
audio coding rate described in section 2. The compression algorithm selected is ADM4 and 
the audio flow is originally at 16 kHz (the output flows corresponds to two ADM4 / 8 kHz 
audio see Table 1.). The receivers implement the congestion control algorithm described in 
section 3 to decide how many flows they are allowed to subscribe to. 

5 We have noticed thai; the difference increases with the network load. 
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Receiver 


flow # 


with shift 




without shift 


CL (%) 




RTT (ms) 


loss (%) 


CL (%) 


RTT (ms) 


loss (%) 




1 


301 


6.0 


6.0 


306 


5.9 


5.9 


INRJA 


2 


301 


5.9 


1.2 


309 


7.5 


3.6 




3 


301 


6.3 


0.4 


313 


9.0 


3.1 




1 


357 


4.4 


4.4 


360 


3.9 


3.9 


UCL 


2 


358 


4.1 


0.3 


363 


4.3 


1.2 




3 


353 


5.0 


0.1 


367 


4.6 


0.8 




1 


308 


0.9 


0.9 


308 


1.1 


1.1 


UTS 


2 


307 


0.9 


0.2 


312 


1.1 


0.4 




3 


308 


0.7 


0.1 


312 


0.9 


0.3 



Table 3: RTT, mean and cumulated packet loss intercorrelation between flows 




Figure 6: Cumul loss rate versus sequence number at INRIA 




Figure 7: Cumul loss rate versus sequence number at UCL 
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Figure S: Cumul loss rate versus sequence number at UTS 



Figures 9 and 10 show the evolutions of the RTT and the packet loss rate estimators 
respective!}* according to the sequence number (corresponding to the audio sample number) 
for both UTS and UCL receivers. The corresponding maximal rate estimation allowed and 
the subscription decision is plotted in Figure 11. The results from the local receiver at MIT 
have not been plotted but show as expected a very low RTT value and a null packet loss 
rate which set the level of flow subscription to its maximal value (i.e. 5). 

As expected, we observe that during periods of congestion (with higher RTT and packet 
loss rate estimates), the algorithm selects a lower number of flows. When the network is 
unloaded, and as soon as the first estimations of the RTT and the packet loss rate are 
available, the algorithm can select the higher rate it is allowed to receive (e.g. for the UCL's 
receiver). The algorithm can also decide to drop several flows at the same time in case of 
severe congestion. 

In the fourth experiment, we examine how multiple sessions interact while running in 
a same network. We have run two different multicast sessions over a low-bandwidth PPP 
link* set between two PCs and used only by these sessions. The two sessions use an ADM4 
(32 kb/s payload) coding at 8 kHz sent over 3 layers with 80 ms of speech encoded per 
packet. Figure 12 shows the subscription decision and the rate received for the two different 
sessions. We note that both sessions have roughly the same behaviour, i.e. the subscription 
decision oscillates between 1 and 2 flows and the mean bandwidth received is about 12 kb/s 
for session 1 and 13 kb/s for session 2. In case of several sessions are run under the same 
network conditions (i.e. same packet loss and RTT), we observe that each session gets 
roughly the same amount of the total network capacities. During tests, we have also noticed 
that the PPP link is more likely to loose bursts of packets rather than isolated packets such 
as common in the MBone. Since the measure of quality of the coding depends on the packet 
loss properties in the network, the following tests have been done on the MBone, even if we 
cannot ensure the same network conditions for two consecutive experiments. 

' The maximal UDP bandwidth measured over this iow-ralc link was 83 kb/s using a packet size equal 
to the MTU =1024 bytes. 
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Figure 11: Subscription level versus sequence number at UCL (left) and UTS (right) 
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Figure 12: Flows subscribed and Received rate for two different sessions over PPP 



4,3 Comparison with "traditional" schemes 

It is interesting to compare our scheme with the transmission scheme used in "traditional" 
applications such as vat. In the fifth experiment, the source at MIT sends the equivalent 
ADM-I compressed 16 IcHz data in one flow. This experiment has been done a couple of 
minutes after the third one, and not in the same time to limit our sending rate on the 
MBone. Of course, network conditions can be slightly different between two experiments, so 
we cannot simply use the rates observed in both experiments to compare quality obtained. 

Using this layered coding scheme in conjunction with a rate control transmission algo- 
rithm will achieve a variable audio quality according to the network conditions. Further tests 
are needed with real users in order to observe their behavior during an audio conference. 
During a session the audio quality may sometimes appear '"metallic" when only one flow is 
received and may sometimes be HiFi (e.g. all flows received corresponding to a 32 kHz / 
PCM audio flow). The best way to measure the quality of the audio flows received would 
have been to use a MOS measurement [34]. However, there is no such measure available yet 
for the experimental PCM and ADM hierarchical encoded schemes described in section 2. 
To coarsely quantify the quality obtained at receivers, we use the instantaneous rate recei- 
ved for each sequence number. The quality of the audio signal increases with the number of 
packets received by sample. The instantaneous quality is null when the sample cannot be 
reconstructed and is maximal when all the packets constituting the sample are received. 

Figures 13 and 14 show respectively the evolutions of the packet loss rate and the cor- 
responding quality observed at UCL and UTS for the non-hierarchical sending experiment 
(fifth experiment). 

We note that in the fifth experiment, the quality received instantaneously is either mini- 
mal (i.e. sample lost) or maximal (sample totally reconstructed). Some of the receivers (e.g. 
UTS) receive lots of "blanks" (e.g. 1 loss encodes 80ms, 5 consecutive losses 0.4 second) which 
make the audio flow sometimes unintelligible. Audio samples corresponding to several experi- 
ments with various packet loss rates can be retrieved at http://wwwAnria.fr/rodeo/turletti/audio/ 
so as to be able to listen to the quality of our scheme versus "traditional schemes". Moreover, 
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Figure 13: Packet loss rate for non hierarchical flows at UCL (left) and UTS (right) 




Figure 14: Received rate for non hierarchical at UCL (left) and UTS (right) 



the output rate sent in the network does not back off and is always set to its maximal value 
(i.e. 64 kbps), yielding a "bad citizen" behavior which may contribute to congestion in the 
Internet, 

Now let us examine how our layered transmission scheme behaves in presence of packet 
loss. Figures 15 and 17 show respectively the evolutions of the packet loss rate and the 
corresponding quality observed at UCL and UTS for the third experiment. 

In period of high congestion (when the receiver subscribes to only one flow), the robust- 
ness of this hierarchical coding scheme allows to reconstruct all the packet lost at UCL and 
UTS. Clearly, this will not be the case if congestion is yet more severe, but in this case, the 
maximal subscription rate allowed could overrun the minimal sending rate of the application 
(e.g. here 21 kb/s of payload) and the receiver should be invited to quit the session and 
restart later. 
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Figure 15: Packet loss rate for hierarchical flows at UCL (left) and UTS (right) 
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Figure 16: Subscription level for hierarchical flows at UCL (left) and UTS (right) 
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Figure 17: Received rate for hierarchical flows at UCL (left) and UTS (right) 
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5 Conclusion 

The control scheme presented in the paper has a number of interesting features compared 
to RLM. In particular, it has "built-in" TCP friendliness, it does not require coordination 
between receivers, and we have seen that results from an actual implementation show good 
convergence properties (in addition to the expected result that the combination of a layered 
coding and a layered transmission scheme does help to handle the problem of multicast 
delivery of heterogeneous networks). However, there remain a number of problems, most 
of them actually related to Equation 1. First, the equation only provides an upper bound 
on the rate of the equivalent TCP connection; there is a need to investigate how far the 
bound is to the actual value. Second, it is not yet clear exactly how a rate based control 
scheme based on equation 1 and an actual window based TCP scheme interact in practice. 
Third, the equation relies on parameters such as the mean roundtrip delay RTT that are 
problematic to estimate accurately in large multicast environments. This, combined with 
questions about the relevance of TCP-friendliness in networks with RED gateways, means 
that the jury is still out on receiver based control schemes for multicast environments, and 
that this remains an important (and somewhat urgent) area for research. 
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A The receiver-based control pseudo code 



int N; 

int NbFlow * 0; 
int k - 0; 
ACTION = JOIN; 



// Receive Data and/or Control Packet 

void Packet.ReceivedQ 
{ 

if (Rtp.PacketO) { 

SeqKb s Get_Sequence_Number() ; 
Update .Loss () ; 
CoraputeJTCPJlateO ; 
Cpt.RTT -= 1; 
if (! Cpt.RTT) { 
Cpt.RTT = N; 
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Send.RR.Packet (SeqNb , CurrentTime () ) ; 

} 

} else 

if ((Rtcp.PacketO) fefc (IsJSRJPacket 0 ) kk 
(Packet .SsrcO MySSRC)) { 
UpdateRTTO; 
Compute„TCP_RateO ; 

> 

} 

// Rate Control Algorithm 

void Algo( . . .) 
{ 

switch (ACTION) { 
case JOIN: 

if (NbFlow) 
k = 0; 

L = Nb_Of_Layer(TCPJtate() , Our.RateO); 
Join„Layers(L t fcNbFlov) ; 
Wait(a+d) ; 
ACTION = NONE; 
break; 
case QUIT: 

L = Nb_Of..Layer(TCP_Rate() , Our.RateO); 
Quit„Layers(L, &NbFlou) ; 
if (NbFlov >= 1) { 

Wait(a*d) ; 

ACTION = NONE; 
} else 

ACTION = RETRY; 
break; 
case NONE: 

if (TCP.RateO > Our.RateO+NEXT.Layer JlateO) 

ACTION = JOIN; 
else 

if (TCPJIATEO < Our.RateO) 

ACTION = QUIT; 
else 

{ 

ACTION = NONE; 
k=0; 

} 

break; 
case RETRY: 

tr = power (2, ++k) ; 
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wait(tr) ; 

Join_Layers (1 , fcNbFlou) ; 
Wait(n*d) ; 
ACTION = NONE; 
break; 
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Abstract 

In this paper, we analyze a performance model for the 
TCP Congestion Avoidance algorithm. The model pre- 
dicts the bandwidth of a sustained TCP connection sub- 
jected to light to moderate packet losses, such as loss 
caused by network congestion. It assumes that TCP 
avoids retransmission timeouts and always has suffi- 
cient receiver window and sender data. The model pre- 
dicts the Congestion Avoidance performance of nearly 
all TCP implementations under restricted conditions 
and of TCP with Selective Acknowledgements over a 
much wider range of Internet conditions. 

We verify the model through both simulation and 
live Internet measurements. The simulations test sev- 
eral TCP implementations under a range of loss con- 
ditions and in environments with both drop-tail and 
RED queuing. The model is also compared to live In- 
ternet measurements using the TReno diagnostic and 
real TCP implementations. 

We also present several applications of the model to 
problems of bandwidth allocation in the Internet. We 
use the model to analyse networks with multiple con- 
gested gateways; this analysis shows strong agreement 
with prior work in this area. Finally, we present sev- 
eral important implications about the behavior of the 
Internet in the presence of high load from diverse user 
communities. . 

1 Introduction 

Traffic dynamics in the Internet are heavily influenced 
by the behavior of the TCP Congestion Avoidance al- 
gorithm [Jac8Ba, Ste97], This paper investigates an 
analytical performance model for this algorithm. The 
model predicts end-to-end TCP performance from prop- 
erties of the underlying IP path. This paper is a first 
step at discovering the relationship between end-to-end 
application performance, as observed by an Internet 
user, and hop-by-hop IP performance, as might be mon- 
itored and marketed by an Internet Service Provider, 

*Thb work if supported in pert by National Science Founda- 
tion Giant No. NCR-94165S2. 



Our initial inspiration for this work was the "heuris- 
tic analysis" by Sally Floyd [Flo9l]. 

This paper follows a first principles derivation of 
the stationary distribution of the congestion window or 
ideal TCP Congestion Avoidance subject to indepen- 
dent congestion signals with constant probability. The 
derivation, by Teunis Ott, was presented at DIM ACS 
[OKM96b] and is available on line [OKM96a], The full 
derivation and formal analysis la quite complex and is 
expected to appear in a future paper. 

We present a simple approximate derivation of the 
model, under the assumption that the congestion signal 
losses arc periodic. This arrives at the same mathemat- 
ical form as the full derivation, although the constant 
of proportionality is slightly different. This paper is 
focused on evaluating the mo del 's applicability and im- 
pact to the Internet. 

The model applies whenever TCP's performance is 
determined solely by the Congestion Avoidance algo- 
rithm (described below). We hypothesize that it ap- 
plies to nearly all implementations of SACK TCP (TCP 
with Selective Acknowledgements) [MMFR96] under 
most normal Internet conditions and to Reno TCP 
[Jac90, Stc94 t Ste97] under more restrictive conditions. 
To test our hypothesis we examine the performance 
of the TCP Congestion Avoidance algorithm in three 
ways. First, we look at several TCP implementations 
in a simulator, exploring the performance effects of ran- 
dom packet loss, packet loss due to drop-tail queu- 
ing, phase effects [FJ92], and Random Early Detection 
(RED) queuing [FJ93]. Next, we compare the model to 
Internet measurements using results from the TReno 
("tree-no 1 *) [Mat96] user mode performance diagnos- 
tic. Finally, we compare the model to measurements 
of packet traces of real TCP implementations. 

Many of our experiments are conducted with an up- 
dated version of the FACK TCP [MM96a], designed 
for use with Selective Acknowledgements. We call this 
Forward Acknowledgments with Rate-Halving (FACK- 
RH) [MM96b]. Except as noted, the differences between 
FACK-RH and other TCP implementations do not have 
significant effects on the results. See Appendix A for 
more information about FACK-RH. 
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Figure I: TCP window evolution under periodic loss 

Bach cycle deliver* (-y ) 3 + j(-y-)* = 1/p pnckeU and takes 
W/2 round trip times. 



2 The Model 



lowed by one drop. Under these assumptions the con- 
gestion window (cwnd in most implementations) tra- 
verses a perfectly periodic sawtooth. Let the maximum 
value of the window be W packets. Then by the def- 
inition of Congestion Avoidance, we know that during 
equilibrium, the minimum window must be W/2 pack- 
ets. If the receiver is acknowledging every segment, then 
the window opens by one segment per round trip, bo 
each cycle must be W/2 round trips, or RTT * W/2 
Time (RTT) seconds. The total data delivered is the area under the 
sawtooth, which is (^) a + ±(^) 3 = *W 2 packets per 
cycle. By assumption, each cycle also delivers 1/p pack- 
ets (neglecting the data transmitted during recovery). 
Solving for W wc get: 



The TCP Congestion Avoidance algorithm [Jac88a] 
drives the steady-state behavior of TCP under condi- 
tions of light to moderate packet tosses. It calls for in- 
creasing the congestion window by a constant amount 
on each round trip and for decreasing it by a constant 
multiplicative factor on each congestion signal. 1 Al- 
though we assume that congestion is signaled by packet 
loss, we do not assume that every packet loss is a new 
congestion signal. For alt SACK-based TCPs, multiple 
losses within one round trip are treated as a single con- 
gestion signal. This complicates our measurements of 
congestion signals. 

We can easily estimate TCP's performance by mak- 
ing some gross simplifications. Assume that TCP is 
running over a lossy path which has a constant round 
trip time (RTT) because it has sufficient bandwidth and 
low enough total load that it never sustains any queues. 
For ease of derivation, wc approximate random packet 
loss at constant probability p by assuming that the 
link delivers approximately 1 jp consecutive packets, fol- 

1 The window it normally opened at the constant rate of one 
maximum segment also (MSS) per round trip time (RTT) 
and halved on each congestion signal. In actual implementations, 
there are a number of important details to this algorithm. 

Opening the congestion window at a constant rate is actu- 
ally implemented by opening the window by small increments 
on each acknowledgment, such that if every segment is ocknosrl- 
edged, the window is opened by one segment per round trip. Lei 
\V be the window site In packets. Bach acknowledgment adjusts 
the window; W 1/W, such that W acknowledgments later 
W has increased by 1. Since W equals cwnd/MSS, we have 
ctund «f = MSS » ftf SS/cwnd } which is how the window opening 
phase of congestion avoidance appears in the code. 

When the congestion window is halved on a congestion signal, it 
ts normally rounded down to an integral number of segments. In 
most implementations the window is never adjusted below some 
floor, typically 2 segments. Both derivations neglect rounding 
and this low window limit. [Plo91| considers rounding, resulting 
in a smalt correction term. 

We are also neglecting the details of TCP data recovery and 
rctransmtsiton. Some form of Fo*t Retransmit and/or Fast Re- 
covery, with or without SACK, is required. The important detail 
ts that the toss recovery is completed in roughly one round trip 
lime, TCP's Self-clock is preserved, and that the new congestion 
window is half of the old congestion window. 



ACM SIGCOMhfl 



68 



W 



(I) 



Substitute W into the bandwidth equation below: 

data per cycle MSS *\W 2 MSS/p 



BW = 



time per cycle 



(2) 

Collect the constants in one term, C = ^/3/2, then wc 
arrive at: mr „„ „ 

B W =^^ (3) 

RTT <fe v ' 

Other forms of this derivation have been published 
[FIo&l, LM94] and several people have reported unpub- 
lished, "back-of-the-cnvclope" versions of this calcula- 
tion [Mat94a, Cla96]. 



Derivation 


ACK Strategy 


C 


Periodic Loss 
(derived above) 


Every Packet 


1.22 =r </3/2 


Delayed 


0.87 = v/3/4 


Random Loss 
follows [OKM9Ga] 


Every Packet 


1.31 


Delayed 


0.93 



Table 1: 
tions. 



Derived values of C under different assump- 



The constant of proportionality (C) lumps to- 
gether several terms that are typically constant for a 
given combination of TCP implementation, ACK strat- 
egy (delayed vs non-delayed) 3 , and loss mechanism. In- 
cluded in the TCP implementation's contribution to C 

3 The Deloyed Acknowledgment ("DA") algorithm JStc94] sup- 
presses half of the TCP acknowledgments to reduce the number 
of tiny messages in the Internet. This changes the Congestion 
Avoidance algorithm because the window increase is driven by 
the returning acknowledgments. The net effect is that when the 
TCP receiver sends Delayed Acknowledgments, the sender only 
opens the window by MSS/2 on eoch round trip. This term can 
be carried through any of the derivations and always reduces C 
byy/2. 

The receiver always suppresses Delayed Acknowledgements 
when it holds partial data. During recovery the receiver acknowl- 
edges every incoming segment. The receiver also suppresses De- 
layed Acknowledgements (or more precisely, transmits acknowl- 
edgements on n timer) when the data packets arrive more than 
200 ms apart. 

There ore alio a number of TCP implementations which have 
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are the constants used in the Congestion Avoidance al- 
gorithm itself. 

The model is not expected to apply under a number 
of situations where pure Congestion Avoidance does not 
fully control TCP performance. In general these phe- 
nomenon reduce the performance relative to that which 
is predicted by the model. Some of Lhcoc situations are: 

1. If the data receiver is announcing too small a win- 
dow, then TCP's performance is likely to be fully 
controlled by the receiver's window and not at all 
by the Congestion Avoidance algorithm. 

2. Likewise, if the sender docs not always have data 
to send, the model is not likely to apply. 

3. The elapsed time consumed by TCP timeouts is 
not modeled. Many non-SACK TCP implemen- 
tations suffer from timeouts when they experience 
multiple packet losses within one round trip time 
[Flo95, MM96a|. These TCP implementations do 
not fit the model in environments where they ex- 
perience such losses. 

4. TCP implementations which exhibit go-back-N 
behaviors do not attain the performance projected 
by the model because the model does not account 
for the window consumed by needlessly retrans- 
mitting data. Although wc have not studied these 
situations extensively, we believe that Slow-start, 
either following a timeout or as part of a normal 
Tahoe recovery, has at least partially go-back-N 
behavior, particularly when the average window is 
small. 

5. TCP implementations which use other window 
opening strategies (e.g. TCP Vegas [BOP94, 
DLY95]) will not fit the model. 

6. In some situations, TCP may require multiple 
cycles of the Congestion Avoidance algorithm to 
reach steady-state^. Aa a result, short connections 
do not fit the model. 

Except for Item 6, all of these situations reduce 
TCP's average throughput. Under many circumstances 
it will be useful to view Equation 3 as a bound on per- 
formance. Given that Delayed Acknowledgements are 
mandatory, C is normally less than 1. Thus in many 
practical situations, wc can use a simpler bound: 

Wc will show that it is important that appropriate 
measurements be used for p and /tTT. For example 
SACK TCP will typically treat multiple packet losses in 
one RTT as a single congestion signal. For this case, the 

bugs in their Delayed Acknowledgment algorithms such that they 
send acknowledgments Ich frequently than 1 per 2 data tegmenta. 
Thcie bugs further reduce C, 

J ThU problem is dimmed in Appendix B, All of the timula- 
tionn in this paper arc sufficiently long such thot they unambigu- 
ously reach equilibrium. 
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proper definition for p is the number of congestion 
signals per acknowledged packet. 

Although these derivations arc for a rather restricted 
setting, our empirical results suggest that the model is 
more widely applicable. 

3 Simulation 

All of our simulations use the LBL simulator, tt ns ver- 
sion 1°, which can be obtained via FTP [MF95]. 

Most of the simulations in this paper were conducted 
using the topology in Figure 2. The simulator associates 
queuing properties (drop mechanism, queue size, etc.) 
with links. The nodes (represented by circles) imple- 
ment TCP, and do not themselves model queues. We 
were careful to run the simulations for sufficient time to 
obtain good measures of TCP*8 average performance" 1 . 

This Bingle link is far too simple to model the com- 
plexity of a real path through the Internet. However, 
by manipulating the parameters (delay, BW, loss rate) 
and queuing models (drop-tail, RED) we will explore 
the properties of the performance model. 




Figure 2: The simulation topologies 



3.1 Queueless Random Packet Loss 

In our first set of experiments, the single link m Fig- 
ure 2 was configured to mode! the conditions under 
which Equation 3 was derived in [OKM06a]: constant 
delay and fixed random packet loss. These conditions 
were represented by a lossy, high bandwidth link 5 which 
does not sustain a queue. 

The choice of our FACK-RH TCP implementation 
does not affect the results in this section, except that 
it is able to remain in Congestion Avoidance at higher 
loss rates than other TCPs, This phenomenon will be 
discussed in detail in Section 3.4. The receiver is using 
standard Delayed Acknowledgements. 

The network was simulated for various combinations 
of delay, MSS t and packet loss. The simulation used 
three typical values for MSS: 536, 1460, and 4312 
bytes. The one-way delay spanned five values from 3 ma 
to 300 ms; and the probability of packet loss was ran- 
domly selected across four orders of magnitude, span- 
ning roughly from 0.00003 to 0.3 (uniformly distributed 

•Thif wm done by uimg the bandwidth-delay product to etti- 
mote an appropriate duration for the iirouUtfon r such thot Con- 
gestion Avoidance experienced SO or more cycle*. The duration 
wm confirmed from instruments In the rimulntor. 

s In order to moke iure that the Unit bandwidth viu not a 
limiting factor, the link bandwidth itlected won more than 10 
timej the estimated bandwidth required. We then confirmed thot 
the link did not suitain a queue. 
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in log(p)). Since each loss was independent (and as- 
sumed to be relatively widely spaced), each loss was 
considered to be a congestion signal. 

In Figure 3 we assume C — \ and plot the simulation 
vs. the model. Each point represents one combination 
of RTT, MSS t and p in the simulation. The X axis 
represents the bandwidth estimated by the model from 
these measurements, while the Y axis represents the 
bandwidth as measured by the simulation. Note that 
the bandwidth, spanning nearly five orders of magni- 
tude, has a fairly strong fit along one edge of the data. 
However, there arc many outlying points where the aim- 
ulation does not attain the predicted bandwidth. 

In Figure 4 we plot a different view of the same data 
to better illuminate the underlying behaviors. Simula- 
tions that experienced timeouts are indicated with open 
markers. For the remainder of the figures (except where 
noted), we rescalc the Y axis by RTT/MSS. The Y 
axis is then B W * RTT/M SS which, from classical pro- 
tocol theory, is a performance-based estimate of the av- 
erage window size 0 . 

We plot p on the X axis, with the loss rate increasing 
to the right. 

To provide a common reference for comparing data 
between experiments, we plot the line corresponding 
to the model with C = 1 in Figure 4 and subsequent 
figures. 

When p < 0.01 (the left side of the graph) the fit be- 
tween the model and the simulation data is quite plau- 
sible. By looking at the data at p = 0.0001 we estimate 
C to be 0.9, which agrees with the Delayed ACK en- 
tries in Table 1. Notice that the simulation and model 
have slightly different slopes, which we will investigate 
in Section 3.5. 

When the average loss rate (p) is large (the right side 
of the graph), the loss of multiple packets per RTT be- 
comes Hkety. If too many packets are lost, TCP will lose 
its Self-clock and be forced to rely on a retransmission 
timeout, followed by a Slow-start to recover. As men- 
tioned above, timeouts arc known not to fit the model. 
Note that the open markers indicate if there were any 
timeouts in the simulation for a given data point. Many 
of the open markers near p = 0.01 experienced only a 
few timeouts, such that the dominant behavior was stilt 
Congestion Avoidance, and the model more or less fits. 
By the time the loss rate gets to p = 0.1 the timeout 
behavior becomes significant. Our choice of FACK-RH 
TCP altera the transition between these behaviorB. We 
compare different TCP implementations in Section 3.4. 

Note that C/^/p can be viewed as the model's esti- 
mate of window aizt* This makes sense because packet 
losses and acknowledgment arrivals drive window ad- 
justments. Although time scale and packet size do de- 
termine the total bandwidth, they only indirectly affect 
the window through congestion signals. 

6 In this paper, "window" always meant "window in packets", 
and not "window in bytci." 
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Figure 3: The Measured vs. Estimated BW 
The simulation used three typical values far M SS; 636, 1460, 
and 4312 bytes. The one-way delay spanned from 3 mi to 300 
nu; and the probability of packet loss was randomly selected 
between 0.00003 to 0.3. In the model we have assumed C 
to be 1. 
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Figure 4: Window vs. Loss 
This is a different view or the same data an in Figure 3. Each 
point has been rcscoied in both axes. 
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3*2 Environments with Queuing 

Since, under the assumptions or Section 2, the Conges- 
tion Avoidance algorithm is only sensitive to packet loss 
and Acknowledgement arrivals we expect the model to 
continue to correctly predict the window when queuing 
delays are experienced. Thus, with an appropriate defi- 
nition for J?IT t the model should hold for environments 
with queuing. 

We performed a Bet of simulations using bottlcneckcd 
links where queuing could take place. We used a drop- 
tail link (Figure 2 with drop-tail) with RTT = 60 ms 
and MSS - 1024 bytes. The' link bandwidth was varied 
from 10 kb/s to 10 Mb/s, while the queue size was varied 
from 5 to 30 packets. Therefore, the ratio of delay- 
bandwidth product to queue length spanned from 15:1 
to 1:400. The simulations in Figures 5 and 6 were all 
performed with the stock Reno module in the simulator. 

In Figure 5 we plot the data using the fixed part of 
the RTT t which includes only propagation delay and 
copy time. Clearly the fit is poor. 

In Figure 6 we re-plot the same data, but use the 
RTT as measured by a MIB-like instrument in the sim- 
ulated TCP itself. The instrument uses the round trip 
time as measured by the RTTM algorithm (JBB92] to 
compute the round trip time averaged across the entire 
connection. This is the average RTT as sampled by the 
connection itself. 

This transformation has the effect of making the Y 
axis a measurement- based estimate of the average win- 
dow. It moves individual points up (relative to the up- 
per graph) to reflect the queuing delay. 

The slope of the data does not quite agree with the 
model, and there are four dusters of outliers. The slope 
(which wc will investigate in Section 3.5) is the same as 
in Figure 4. 

The four clusters of outliers are due to the long 
packet times at the bottleneck link causing the De- 
layed ACK timer to expire. This effectively inhibits the 
Delayed ACK algorithm such that every data packet 
causes an ACK, raising C by a factor of y/2 for the af- 
fected points, which lie on a line parallel to the rest of 
the data. 

We conclude that it is necessary to use an RTT mea- 
surement that is appropriate for the connection. The 
RTT as sampled by the connection itself is always ap- 
propriate. Under some circumstances it may be possible 
to use other simpler measures of RTT t such as the time 
average of the queue at the bottleneck. 

Reno fits the model under these conditions because 
the idealized topology in Figure 2 drops exactly one 
packet at the onset of congestion 7 , and Reno's Fast 

T It haj been observed that Reno TCP's Self- clock is fragile in 
the presence of multiple lost pockets within one round trip [HocOS, 
FloOS, Hoe98, FF80, MMQBa, LM94]. In the simulator, a linjtc 
TCP connection in ongoing Congestion Avoidance nearly always 
cause* the queue at the bottleneck to drop exactly exactly one 
packet when U fill*. This is because the TCP opens the window 
very gradually, and there Is no cross traffic or ACK campreaaton 
to introduce jitter. Under these conditions Reno avoids any of its 
problems with closely spaced tosses. 



0-fiOOI I J cmi 2 l 0,01 2 s 

Figure 5: Estimated Window vs. Loss. 
Simulations of Reno over a botttenccked link with a drop-tail 
queue, without correcting for queuing ddoy. The .RTT wo* 60 
mi and the MSS wan I kbyte. The bandwidth was varied from 
10 kb/s to tO Mb/s, and the queue sise was varied from 6 to 30 
packets. 
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Figure 6: Estimated Window vs. Loss. 

This is o different view of the same data as Figure 5, 
transformed by using TCP's measure of the RTT, 
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Recovery and Past Retransmit algorithms are sufficient 
to preserve the Self- clock. Under these conditions Reno 
exhibits idealized Congestion Avoidance and fits the 
model. If the simulations are re-run using other TCP 
implementations with standard Congestion Avoidance 
algorithms 6 the resulting data is nearly identical to Fig- 
ures 5 and 6. For NewRcno, SACK and PACK the data 
points agree within the quantization errors present in 
the simulation instruments. This is expected t because 
all are cither derived from the original Reno code, or 
were expressly designed to have the same overall be- 
havior as Reno when subjected to isolated losses. 

3.3 Phase Effects 

Phase effects [FJ92] are phenomena in which a small 
change in path delay (on the order of a few packet times) 
has a profound effect on the observed TCP performance. . 
It arises because packets leaving tbe bottlcnecked link 
induce correlation between the packet arrival and the 
freeing of queue space at the same bottleneck. In this 
section wc will show why phase effects arc consistent 
with the model, and what this implies about the future 
of Internet performance instrumentation. 




Figure 7: Phase Effects topology. 

* The simulator includes models Tor several different TCP im- 
plementation*. Tahoe [JgcBBb} and Reno (described in (Ste97) and 
jjoc&O]) are well known. The simulator also includes a SACK im- 
plementation "SACKl* [FloSO], which was based on the original 
[JBBBj SACK, but Kn been updated to RFC 3018 (MMFH96). 
This it, by design, a fairly it might forward implementation of 
SACK TCP tiding Reno-style congcitian control. NewRcno U 
a version of Reno that hat some modifications to correct what U 
essentially a bug that frequently causes needless timeouts in re* 
sporoe to multiple-packet congestion signals. This modification 
was first suggested by Jemie Hoc (Hoe95, CH05] end has been 
thoroughly onolyttd [FloDS, FFGO). 

Tnhoe TCP has significantly different ateady-itatc behavior 
than newer TCP implementations. Whenever a loss is de- 
tected the congestion window is reduced to 1 (without changing 
$tlhre»h). This causes a Slow-itart, taking roughly I003W round 
trips, and delivering roughly W segments). Tahoe does not fit 
the model in a badly undcrbuffered network (due to persistent 
repeated timeouts). At higher loss rates when a larger fraction 
of the overall time is spent In Slow-Start, Tahoe has a slightly 
different shape t and therefore the model is less accurate. 



In Figure 7 we have reconstructed 9 one of the sim- 
ulations from JFJ92], uoing two SACK TCP connec- 
tions through a single bottlcnecked link with a drop-tail 
router. Rather than reconstructing the complete sim- 
ulation in which the variable delay is adjusted across 
many closely spaced values, we present a detailed anal- 
ysis of one operating point, 6 = 9.9 ma. In this case, 
the packets from connection 2 arc most likely to arrive 
just after queue space has been freed, but just before 
packets from connection 1- Since the packets from con- 
nection 2 have a significant advantage when competing 
for the last packet slot in the queue, connection 1 sees 
more packet drops. 

The packets are 1 kbyte long, so they arrive at the 
receiver (node Kl) every 10 ms. The 15 packet queue 
at link L slightly undcrbuffcrs the network. We added 
instrumentation to both the bottletiecked link and the 
TCP implementations, shown in Thble 2. The Link 
column presents the link instruments on L t including 
total link bandwidth and the time average of the queue 
length, expressed as the average queuing delay. The two 
TCP columns present our MIB-like TCP instruments 
for the two TCP connections, except for the RTT Es- 
timate row, which is the average RTT computed by 
adding the average queue length of the link to the min- 
imum JtTT of the entire path. 

The loss instruments in the TCP implementation 
categorize each loss depending on how it affected the 
congestion window. Losses that trigger successful 
(clock-preserving) dividevby-two window adjustments 
arc counted as "CA events". Ail other downward win- 
dow adjustments (i.e. timeouts) arc counted as "non- 
CA events", Additionalloeses which are detected while 
already in recovery and do not cause their own win- 
dow adjustments and arc counted aa "other looses" 10 . 
In the drop-tail case (on the left side of the table), wc 
can sec that TCP1 experienced 103 CA events and 37 
non-CA eventB (timeouts). During those same recovery 
intervals, there were an additional 76 losses which were 
not counted as congestion signals. Note that p is the 
number of CA events per acknowledged segment. The 
link loss instruments, by contrast, do not categorise lost 
packets, and cannot distinguish losses triggering conges- 
tion avoidance. 

The TCP R.TT instrument is the same as in the 
previous section (i.e. based on tbe RTTM algorithm). 

Note that even though the delay is different by only 
4.9 ms there is about a factor of 4 difference in the per- 
formance. This is because the loss rate experienced by 

* Our simulation Is identical, except that wc raised the re- 
ceiver's window such that it does not interfere with the Con- 
gestion Avoidance algorithm. This altera the overall behavior 
somewhat become the dominant connection can potentially cap- 
ture the full bandwidth of the link. 

10 Every low episode counts ao exactly one CA or non-CA event. 
Episodes in which there was a Fast Retransmit, but Fast Recovery 
won unsuccessful at preserving the Sell-clock or additional losses 
caused additional window reductions were counted as non-CA 
events. 

All additional retransmissions (occurring In association with 
either n timeout or congestion signal) ere counted as additional 
lost packets. 
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Tabic 2: Phase effects with queue limit = 15, 8 = 9.9 
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Tabic 3: Phase effects with queue limit - 100, 6 - 9.9 
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34750 


packets 




49845 
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TCP RTT 
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396 (1%) 
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TCP Model 


kb/s 




404 (1%) 
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each connection is different by an order of magnitude. 

The model is used to predict performance in two 
different wayB. The first technique, the Link Model, 
uses only the link instruments, while the second, the 
TCP Model, uses only the TCP instruments. Clearly 
applying the model to the aggregate link statistics or to 
TCP1 statistics (with 37 timeouts) in the drop- tail case 
can not yield accurate results. The model 11 applied 
to TCP2's internal instruments correctly predicts the 
bandwidth. 

Random Early Detection (RED) [PJ93) is a form 
of Active Queue Management [B + 97], which manages 
the queue length in a router by strategically discarding 
packets before queues actually fill. Among many gains, 
this permits the router to randomize the packet losses 
across all connections, because it can choose to drop 
packets independent of the instantaneous queue length, 
and before it is compelled to drop packets by buffer 
exhaustion. 

In the phase effects paper [FJ92J, U is observed that 
if a router uses RED instead of drop- tail queuing, the 
phase effects disappear. In the right side of Table 2 we 
present a simulation which is identical to the left side, 
but using RED at the bottleneck (link L). With RED, 
the link instruments are in nearer agreement with the 
TCP instruments; so the model gives fairly consistent 
results when calculated from cither link statistics or 

11 Since we are moot interested in the drop-tail simulation near 
p ~ 0.01, we estimated a locally-accurate value of C = 0.8 by 
examining the data used in Figure 6 in the previous lection. Thii 
value of C wai used for all the mode) bandwidth* shown in Ta- 
blet 2 and 3. 



TCP instruments 13 . The residual differences between 
the results predicted by the model arc due to p not 
being precisely uniform between the two TCP connec- 
tions. This may reflect some residual bias in RED, and 
bears further investigation. 

In Table 3 we repeated the simulations from Table 2, 
but increased the packet queue limit at link L to 100. As 
you would expect, this only slightly changes the RED 
case. However, there are several interesting changes to 
the drop tail case. 

Average RTT has risen to a full second. Without 
RED to regulate the queue length, SACK TCP only 
halves its window when it fills the 100 packet queue. 
TCP's window is being regulated against a full queue, 
rather than some other operating point closer to the 
onset of queued data. Even if both connections expe- 
rience a packet loss in the same RTT, the queue will 
not fully drain. We can gauge the queue eIzcb from the 
average link delay instrument: 800 ms corresponds to 
80 packets. We know that the peak is 100 packets, so 
the minimum queue is likely to be near 60 packets, or 
600 msl This is not likely to please interactive users. 

The tripling of the RTT requires an order of mag- 
nitude lower loss to sustain (roughly) constant band- 
width. The model is less accurate, even when applied 
to the TCP instruments because the loss sample size is 
too small, causing a large uncertainty in p. Excluding 
the initial Slow-start, TCP2 only experienced 5 losses 

13 Note that RED also lowered the average link delay, lowered 
the total packet losses, raised the aggregate throughput. RED 
usually signals congestion with isolated lowzea. Therefore Reno 
might operate as well as SACK TCP in this environment. 
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during the 500 second measurement interval f i.e. eacb 
Congestion Avoidance cycle took 100 seconds!) 

The symptoms of ovcrbuffering without RED are: 
long queuing delays and very long convergence time for 
the congestion control algorithm. 

Abo note that our opening problem of projecting 
end-to-end TCP performance from hop-by-hop path 
properties requires reasonable assurance that the link 
statistics collected at any one hop are indicative of 
that hop's contribution to the end-to-end path statis- 
tics. This requirement is not met with drop-tail routers, 
where correlation in the traffic causes correlation in the 
drops. 

If packet losses are not randomized at each bottle- 
neck, then hop-by-hop performance metrics may not 
have any bearing upon end-to-end performance. RED 
(or possibly some other form of Active Queue Manage- 
ment) is required for estimating end-to-end performance 
from link statistics. Conversely, if a provider wishes to 
assure end-to-end path performance, then all routers 
(and other potential bottlenecks) must randomise their 
losses across all connections common to a given queue 
or bottleneck. 

Also note that if the losses are randomized, C/y/p is 
a bound on the window size for all connections through 
any bottleneck or sequence of bottlenecks. Further- 
more, connections which share the same (randomized 
loss) bottleneck tend to equalize their windows [CJ89J. 
We suspect that this is the implicit resource allocation 
principle already in effect in the Internet today. 13 

3.4 Effect of TCP Implementation 




Figure 8: Algorithm dominance vs packet loss. 
The fraction of all downward window adjustments 
which are successful (clock-preserving) divide-by-two 
window adjustments. 



,J Note that our observation U independent of" the model in 
this paper. To the extent that the window ii only determined 
by lout* (which isn't quite true) and that losses ere equalised at 
bottlenecks (which alio isn't true without RED, etc.), the Internet 
must tend to equalize windows. 



We wish to compare how well different TCP implemen- 
tations fit the model by investigating two aspects of 
their behavior. We first investigate the transition from 
Congestion Avoidance behavior at moderate p to time- 
out driven behavior at larger p. In the next section, we 
investigate a least squares fit to the model itself. 

As mentioned earlier, the model does not predict the 
performance when TCP is timeout driven. Although in 
our simulations timeouts do not cause a serious per- 
formance penalty, we have not included cross traffic or ■ 
other effects that might raise the variance of the RTT 
and thus raise the calculated retransmission timeout. 
Although Figure 4 might seem to imply that the model 
fits timeouts, remember that this was in a queuelcss 
environment, where there is zero variance in the RTT. 
Under more realistic conditions it is likely that timeouts 
will significantly reduce the performance relative to the 
model's prediction. 

Wc simulated all of the TCP implementations sup- 
ported by the simulator 8 , with and without Delayed 
ACK receivers, and instrumented the simulator to tab- 
ulate all downward window adjustments into the same 
two categories as used in the previous section. The first, 
tt CA events," includes all successful (clock preserving) 
divide-by- two window adjustments. The second, "non- 
CA events" , includes all other downward window ad- 
justments. In Figure 6 we plot the proportion of all 
downward adjustments which were successful invoca- 
tions of the Congestion Avoidance algorithm. (This 
data is. also summarised on the right side of Table 4). 

FACK-RH TCP avoids timeouts under more se- 
vere loss than the other TCP implementations because 
it normally sends one segment of new data after the 
first duplicate acknowledgment but before reaching the 
rfupadb threshold (triggering the first retransmission- 
see Appendix A). All of the other TCP's are unable 
to recover from a single loss unless the window is at 
least 5 packets". The horizontal position of the steep 
downward transition reflects the loss rate at which the 
various TCPs no longer retain sufficient average window 
for Fast Retransmit. Under random loss SACK, Reno, 
NewReno, and lahoe all have essentially the same char- 
acteristics. 

3.5 Fitting the elope 

As we have observed in the previous sections, the win- 
dow vs. loss data falls on a fairly straight line on a log- 
log plot, but the slope is not quite -1/2. This suggests 
that a better model might be in the following form: 



BW = Cv 

RTT 



(5) 



Where k is roughly -1/2. 

Wc performed a least mean squared fit between 
Equation 5 and the TCP performance as measured in 
the simulator. The results are shown in Table 4. AH 

14 One packet i» lost, the next three cause duplicate acknowl- 
edgements, which are only counted. The lost packet » not re- 
transmitted until the fifth pocket is acknowledged. 
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Least Mean Squares 


Proportion of successful 


Acknowledgement 


TCP 






fit 




W/2 adjustments 


Scheme 


Implement- 




Equation 3 


Eq 


uation 5 










ation 


N 


C 


k 


a 


p=.01 


p = 0.033 


p = 0.1 




FACK 


16 


1.352 ±0.090 


-0,513 


1.205 ±0.047 


0.996 


0.985 


0.738 


No 


SACK 


11 


1.346* 0.052 


-0.508 


1.247 ±0.033 


0.992 


0.B22 


0.497 


Delayed 


Reno 


12 


1.331 ±0.054 


-0.521 


1.096 ±0.009 


0.935 


0.765 


0.331 


ACKs 


New Reno 


12 


1.357 ±0.055 


-0.516 


1.167 ±0.020 


0.983 


0.896 


0.517 




Tahoe 


11 


1.254 ±0.079 


-0.534 


0.920 ±0.015 


0.974 


0.796 


0.367 




FACK DA 


15 


0.928 ±0.086 


-0.519 


0.783 ±0.045 


1.000 


0.929 


0.725 


Delayed 


SACK DA 


10 


0.038 ±0.036 


-0.518 


0.792 ±0.012 


0.952 


0.664 


0.112 


ACKs 


Reno DA 


10 


0.939 ±0.046 


-0.524 


0.752 ±0.015 


0.919 


0.595 


0.157 




New Reno DA 


11 


0.935 ±0.045 


-0.526 


0.738 ±0.006 


0.942 


0.635 


0.176 




Tahoe DA 


11 


0.883 ± 0.076 


-0.542 


0.596± 0.012 


0.919 


0.590 


0.173 



Table 4: Comparison of various TCP implementations. 



simulations which experienced timeouts were excluded, 
so the fit was applied to runs exhibiting only the Con- 
gestion Avoidance algorithm 15 . The number of euch 
runs are shown in column N. 

For k - -0.5, the values of C are quite close to 
the derived values. The quality of the fit is also quite 
good. As expected, Delayed Acknowledgements change 
C by y/2. When k is allowed to vary slightly, the fit 
becomes even better still, and the best values for J: are 
only slightly off from —0.5. This slight correction to 
k probably reflects some of the simplifying assumptions 
used in the derivation of Equation 3. One simplification 
is that TCP implementations perform rounding down 
to integral values in several calculations which update 
cwnd* The derivation of the model assumes cwnd varies 
smoothly, which overestimates the total amount of data 
transferred in a congestion avoidance cycle. Another 
simplification is that the model expects the window to 
begin increasing again immediately after it is cut in half. 
However, recovery takes a full j£TT, during which TCP 
may not open the window. We plan to investigate the 
effects of these simplifications in the future. 

4 TReno results 

Much of our experimentation in TCP congestion dy- 
namics has been done using the TReno performance 
diagnostic [Mat96]. It was developed as part of 
our research into Internet performance measurement 
under the IETF IP Performance Metrics working 
group [Mat97). TReno is a natural succession to the 
windowed ping diagnostic (Mat94b]> (The FACK-RH 
algorithm for TCP is the result of the evolution of the 
congestion control implemented in TReno.) 

TReno is designed to measure the single stream bulk 
transfer capacity over an Internet path by implement- 
ing TCP Congestion Avoidance in a user mode diagnos- 
tic tool. It is an amalgam of two existing algorithms: 
traceroutc |Jacf}8b] and an idealised version of TCP 
congestion control. TReno probes the network with ei- 

16 FACK fits leu well because it avoid* timeouts and thus in- 
cludes data at larger p, where rounding termi become significant. 
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Figurc 9: TReno measurement data 
This data fits Equation 3 with C rs 0.744 ± 0.1 03 or to Equation 5 
with k = -0.617, C - 0.376 ± 0.064. These arc poorer than 
the flU in Table 4, in part because the TReno data extends lo 
much worse loss rates, where the effects of rounding become more 
pronounced. The C values are lower, indicating about 20% lower 
performance than TCP in a simulator. 
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thcr ICMP ECHO packets (as in the ping program), 
or low-TTL UDP packets, which solicit ICMP errors 
(as in the t racer outc program). The probe packets are 
subject to queuing, delay and congestion-related loss 
comparable to TCP data and acknowledgment packets. 
The packets carry sequence numbers which are reflected 
in the replies, such that TRcno can always determine 
which probe packet caused each response, and can use 
this information to emulate TCP. 

This has an advantage over true TCP for exper- 
imenting with congestion control algorithms because 
TRcno only implements those algorithms and does not 
need to implement the rest of the TCP protocol, such 
as the three- way SYN handshake or reliable data de- 
livery. Furthermore, TRcno is far better instrumented 
than any of today's TCP implementations. Thus it is a 
good vehicle to test congestion control algorithms over 
real Internet paths, which arc often not well- rep resented 
by the idealized queuing models used in simulations. 

However, TRcno has some intrinsic differences from 
real TCP. For one thing, TReno does not keep any state 
(corresponding to the TCP receiver's state) at the far 
end of the path. Both the sender's and receiver's be- 
haviors arc emulated at the near end of the path. Thus 
it has no way to distinguish between properties (such 
as losses or delay) of the forward and reverse paths 16 . 

For our investigation, TReno was run at random 
times over the course of a week to two hosts utilising 
d liferent Internet providers. Due to normal fluctuation 
in Internet load we observed nearly two orders of mag- 
nitude fluctuations in loss rates. Each test lasted 60 
seconds and measured model parameters p and RTT 
from MIB-likc instruments. 

The TReno data 17 is very similar to the simulator 
data in Figure 4, except that the timeouts have a more 
profound negative impact on performance. If the runs 
containing timeouts are neglected, the data is quite sim- 
ilar. 

Also note that TReno suffered far more timeouts 
than FACK-RH in the simulator, even though they have 
nearly identical internal algorithms. This is discussed 
in the next section, where we make similar observations 
about the TCP data. 

5 TCP measurements 

In this section, we measured actual TCP transfers to 
two Internet sites in order to test the model. This was 
done by using a slightly rnodiBcd version of "tcptrace" 
[Ost96] to post- process packet traces to reconstruct p 
and H7T from TCP's perspective. These instruments 
arc nearly identical to the instruments used in the TCP 
simulations. 

11 If the ICMP rcpliei originate from a router, (such a* interme- 
diate traccroutehops) TReno may suffer from a low performance 
ICMP implementation on the router. This is not an issue in the 
data presented here, because the end systems can send ICMP 
repJici at fut) rate. 

17 TReno emulates Delayed Acknowledgements. 
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Figure 10: Measured TCP data, Set 1 

This date fit* Equation 3 with C = 0.700 ±0.057 or to 
Equation 6 with k = -0.525, C = 0.674 ± 0.045. These values 
for C arc about 2S% lower than TCP in a simulator. 
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Figure 11: Measured TCP data, Set 2 

This data fits Equation 3 with C = O.fcOB i 0.134 or to 
Equation S with k - -0.811 t C = 0.418 ± O.0B6. Some of the 
individual data points ore above the C =: 1 reference line. The 
live Internet measurements were not over long enough intervals 

to permit trimming the Slow-start overshoot described in 
Appendix B. This data set may have been subject to Slow-* tort 
overshoot. 
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This experiment proved to be far more difficult than 
expected. We encountered a number of difficulties with 
the test paths themselves. The Internet is vastly noisier 
and (ess uniform than any simulation [Pax 9 7a, Pax 9 7b]. 
Furthermore, several paths exhibited behaviors that are 
beyond the scope of the model 18 .. 

The tests were run at random times over the course 
of 10 days, at an average rate of once per hour to each 
remote site. The period included a holiday weekend 
(with unusually low background traffic), and was not 
the same week as the TReno data. 

During our testing, the connections transferred as 
much data as possible in 100 seconds of elapsed time 
to two different Internet sites. Figure 10 shows that 
the model fits fairly well to data collected to one tar- 
get. If you compare this to Figure 4 it is in reasonable 
agreement, considering the difference in scale. 

Figure 11, on the other hand, does not fit as well. 
It is illustrative to dissect the data to understand whit 
is happening over this path, and how it relates to the 
model *b applicability. Our first observation is that there 
are too many timeouts (indicated by open circles), con- 
sidering the low overall loss rate. 

To diagnose this phenomenon, wc looked at the raw 
packet traces from several of the transfers. Nearly all of 
the timeouts were the result of losing an entire window 
of consecutive packets. These short "outages" were not 
preceded by any unusual fluctuation in delay. Further- 
more, the following (Tahoe-style) recovery exhibited 
no SACK blocks or step advances in the acknowledg- 
ment number. Therefore an entire window of data had 
been lost on the forward path. This phenomenon has 
been observed over many paths in the Internet [Pax97b, 
p305] and is present in Figures 9 and 10, as well. Lore 
in the provider community attributes this phenomenon 
to an interaction between routing cache updates and 
the packet forwarding microcode in some commercial 
routers 15 . 

Our second observation (regarding traces without 
timeouts) is that the number of tt CA events" is very 
small, with many traces showing 3 or fewer successful 
window halving episodes. An investigation of the trace 
statistics reveals that the path had a huge maximum 
round trip time (1800 ms), and that during some of our 
test transfers the average round trip time was as large 
as a full second. This suggests that the path is over- 
bufTcred and there is no active queue management in 
effect to regulate the queue length at the bottleneck. 

Real TCP over this path exhibits the same symp- 
toms as the simulation of an ovcrbuffered link without 
RED in Section 3.3: long queuing delays and very long 
cycle times for the congestion avoidance algorithm. As a 
consequence, our 100 second measurement interval was 
hot really long enough and captured only a few conges- 

. ,a One discarded path suffered from packet reordering which 
wo* icvere enough where the majority of the retransnrmsions were 
spurious. 

10 Note that bunt Josses and massive reordering ore not de- 
tectable using tools with low sampling rntei. These sorts of prob- 
lem can moit cosily be diagnosed in the production Internet with 
tools that operate at normal TCP transfer rotes. 



tion signals, resulting in a large uncertainty in p. The 
open circles on the left side of Figure 11 have observ- 
able vertical banding in the data corresponding to 1 , 2 
or 3 total congestion avoidance cycles. 30 . Traces with 4 
or more congestion avoidance cycles are included in the 
good data (solid squares). 

Our test script also used conventional diagnostic 
toolB to measure background path properties bracketing 
the TCP tests. Although we measured several param- 
eters, the RTT statistics were particularly interesting. 
For "not under test** conditions, the minimum RTT was 
72.9 ms, and the average RTT was 82 ms 31 . From the 
tcptrace statistics, we know that during the test trans- 
fers the average RTT rose to 461 ms 3a . 

Our TCP transfers were sufficient to substantially 
alter the delay statistics of the path. We believe this to 
be an intrinsic property of Congestion Avoidance: any 
long-running TCP connection which remains in Con- 
gestion Avoidance raises the link delay and/or loss rate 
to find its share of the bottleneck bandwidth. Then 
Equation 3 will agree with the actual bandwidth for the 
connection! and if the losses at the bottleneck are suffi- 
ciently randomised, the link statistics (delay and loss) 
will be common to all traffic sharing the bottleneck. 

In general, the current Internet docs not seem to 
exhibit this property. We suspect that this is due to a 
combination of effects, including Reno's inability to sus- 
tain TCP's Self-clock in the presence of closely-spaced 
losses, the prevalence of drop-tail queues and faulty 
router implementations. 

6 Multiple Congested Gateways 

In this section we apply the model to the problem of 
TCP fairness in networks with multiple congested gate- 
ways. Floyd published a simulation and heuristic anal- 
ysis of this problem in 1991 [Flo91]. In this paper, 
she analyzed the following problem: given a network 
of 2n gateways, where n are congested by connections 
that use only one of the congested gateways, what por- 
tion of the available bandwidth will a connection pass- 
ing through all n congested gateways receive? The 
analysis of this problem presented by Floyd computes 
bandwidth by determining the packet loss rate for each 
connection 23 . Here, we demonstrate that the same rc- 
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30 The bands are at roughly p ~ 0.0001 5, 0.00035 and 0.0005. 

31 Each background measurement consulted of 200 ATT samples 
taken cither shortly before or shortly after each test TCP transfer. 
The median of the measurement averages was 60 ms for the "not 
under test" case. 

3 3 The median of the measurement averages wbs 468 ms for the 
"under test" cases. 

Unfortunately, the burst losses obscured our background loss 
rate measurement, bccouic in the average they caused rati e A more 
packet fats than the (rue cony est ton signal*. Since they caused 
TCP timeouts, they were implicit ty excluded from the TCP data, 
but not from our background measurements. 

Note that to some extent the burst losses and the RBD-less 
overbuflcring arc complementary bugs because each at least par- 
tially mitigates the effects of the other. 

" We should note that Floyd's work was published three years 
prior to Mathis [Mat84a] and five years prior to Ott {OKM96aj. 
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Figure 12: The multiple congested gateways scenario. 

suits can be obtained by using Equation 3. 

Figure 12 {from [Flo91]) shows the exact scenario wc 
are interested in analyzing. Each dotted line indicates 
a connection from a source to a sink. Generalising the 
parameters, we define 5 to be the individual delay of the 
long links (50 ms in Figure 12) and e to be the delay 
of the short links (5 ms in Figure 12). The round-trip 
delay for Connection i is: 

£i = 2S + £ Q +A€ (6) 
Here, wc add a term Sq which represents the average 
delay due to queuing at the bottleneck router. 24 The 
round-trip delay for Connection 0 is: 

6 0 = 2(2n - 1)6 + nSq + 4e (7) 

The model can then be used to predict the band- 
width for each connection: 



(12) 



Bi = C 



B 0 = C 



MSS I 

Si y/p 
MSS 1 
So \ffp 



(8) 



(9) 



The total link bandwidth used at each congested hop 
is given by the sum of these two values: 

^* + *. c i^k + £) (10) 

The fraction of the bandwidth used by Connection 
0 is then given by (divide Equation 9 by Equation 10): 



Bo 
B 



(n) 



The first part of Equation 11 exactly matches 
Floyd's result [Flo91, Claim 5, Equation 3} 3S . The sec- 
ond part of Equation 11 fills in the precise formulae for 
the delays. If we assume 6q and e are small, we again 
match Floyd's results [Flo91 t Corollary 6]. 

34 Wc m«tte the assumption that Sq is the same Tor all of the 
connection* being considered. This U probably not a realuttc ilb- 
cumptiontbut it does allow us to simplify the problem somewhat. 

35 Noting that we ore using an increaie-by-1 window in- 
crease algorithm, which it identical for both the long and short 
connections. 



B 1 + y/n{2n - I) 

In the case of ovcrbuffered, drop-tail gateways, 
where Sq is large and phase effects are not an issue, 
we get a slightly different result: 



Bo 1 

B ^ l + n3/' 



(13) 



It is useful to note that C has dropped out of the 
calculation* A precise estimate of C was not needed. 

In this section, we have used Equation 3 to estimate 
TCP performance and bandwidth allocation in a com- 
plex network topology. Our calculation agrees with the 
prior heuristic estimates for this environment. 

7 Implications for the Internet 
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Figure 13i Simplified Internet Topology 

In this section we use Equation 3 to investigate some 
properties of traffic and congestion in the global Inter- 
net; Figure 13 is a very simplified schematic of the 
Internet. To the left ore information providers, includ- 
ing a large population of WWW servers and other large 
data centers, such as BUpercom puling resources. 

To the right are information consumers, including a 
large population of modem-based users and a smaller 
population of Research and Education users, who are 
trying to retrieve data from the data center. 

The link between the data suppliers and consumers 
represents a long path through an intra-continental In- 
ternet. For illustration purposes we assume that the 
path can be modeled as having a single bottleneck, such 
that the entire path has a fixed average delay and a vari- 
able loss rate due to the total load at the bottleneck. 
Furthermore we assume that each individual data sup- 
plier or consumer shown in Figure 13 presents an in- 
significant fraction of the total load on the link. The 
loas rate on I is determined by the aggregate load (of 
many modems and R&E users) on the link and is uni- 
form across all users. Thus the individual connections 
have no detectable effect upon the loss rate. 

We wish to investigate how changes in the Iobb rate at 
link L might affect the various information consumers 
and to explore some strategies that might be used to 
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Figure 14; Packet loss experienced by modem users. 

Packet loss on link L is largely offset by reduced toss at the 
modem. "Totft) lo»" is computed from the link instruments, 

"CA-E vents" from the instruments in the TCP. (This is for 
FACK-RH TCP). 
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Figure 15: Effect of loss rate at link L, 

While modem users (with a 1000 byte MSS) do not notice lots 

on ltnk L until it ii quite high, R£cE users suffer severe 
performance degradation, particularly when uiing a small MSS. 



control the elapsed time needed to move scientific data 
sets. 

Firat we consider TCP congestion control at the 
moderns, which are the likely bottlenecks for the low 
end users. A full-siaed packet (1000 bytes) takes about 
277 ms of modem transmission time. The RTT of the 
unloaded path is about 400 ms (1000 byte packets flow- 
ing in one direction, 40 byte acknowledgements flowing 
in the other). Assume 5 packets of queue space at the 
modem, then the maximum window size is between 6 
and 7 packets, so the "half window" must be roughly 
3 packets, yielding an average window of about 5 pack- 
ets. Since the data' packets arrive at the clients with 
more than 200 ms headway, acknowledgements will be 
sent for every segment (due to the Delayed Acknowl- 
edgement timer). Thus we assume that C = 1.2 (the 
non- Delayed Acknowledgement periodic case from Ta- 
ble 1), so the 5 packet average window requires a loss 
rate of roughly 4% (p = 0.04). This can be observed 
on the far left edge of Figure 14. (We are assuming 
FACK-RH TCP). 

The packet tosa at the modem provides feedback to 
the TCP sender to regulate the queue at the modem. 
This queue assures that the modem utilisation is high, 
such that data is delivered to the receiver every 277 ms. 
These data packets cause the receiver to generate ac- 
knowledgements every 277 ms, which clock more data 
out of the server at a nearly constant average rate. 

Next we want to consider what happens if link L 
loses about 3% of the packets. Since this is not enough 
to throttle TCP down to 28.8 kb/a, the modem must 
still be introducing some loss, but less than before. 
Since the modem is still introducing loss, it must still 
have a significant average queue, so the server still sends 
data every 277 ms. With any SACK TCP (including 
FACK-RH), only the missing data is retransmitted, so 
the average goodput for the modem user continues to 
be 28.8 kb/B. Therefore the 3% loss on link L has an 
insignificant effect on the modem user. 

As the loss rate rises beyond 3%, the queue at the 
modem becomes shorter, reducing the RTT from about 
1.2 seconds down toward 400 ma. Note that a 5 packet 
queue overbuffers the path, so the utilization does not 
start to fall until the loss rate approaches 10% 36 . 

Now consider the plight of an R&E user (See Fig- 
ure 15). What performance limitations are imposed on 
the R&E user by 3% loss? From Equation 3, the aver- 
age window size must be about 5 packets. If these are 
536 byte packets (with a 100 ms RTT), the UBer can get 
no more than about 250 kb/s. Timeouts and other dif- 
ficulties could further reduce this performance. At 250 
kb/s, moving 1 Gigabyte of data 3 ' takes over 8 hours. 

31 Note that thu is FACK-RH TCP, which docs substantially 
better than other TCPs in this region. 

Many recovery episode* exhibit multiple dropped packets (note 
that the total link Ion rate and CA-Evcnts differ) so Reno hu no 
hope or preserving iti Self-clock. Ad the peak window nine falls 
below 5 packets, conventional Fast Retransmit will also fail. 

"Note that at today prices, with disk space ovmllnblc at be- 
low 9100 per Gigabyte, workstations commonly have several Gi- 
gabytes of disk space. It ii not at all unusual for researchers ta 
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As a consequence, some researchers have been known 
to express mail tapes instead of using the Internet to 
transfer data sets. 

Suppose the R&E user needs to move 1 Gigabyte of 
data in 2 hours. This requires a sustained transfer rate 
of about 1 Mb/s. What loss rate docs the user need 
to meet this requirement? Assume C < 1 (because the 
R&E receivers will be using Delayed Acknowledgments) 
and invert Equation 4 to get a bound on p: 

( MSS V 

The model predicts that the R&E user needs a loss 
rate better than 0.18% (p = 0.0018) with 536 byte pack- 
ets. At 1460 bytes, the maximum loss rate rises to 1.4%. 
if the R&E user upgrades to FDDI (and uses 4312 byte 
packets), Equation 14 suggests thai the network only 
needs io have less than 11% loss. 

In practice, we need to consider the actual value cf 
C and potential bottlenecks in all other parts of the 
system, as well as the details of the particular TCP im- 
plementation. This calculation using the model agrees 
with the simulation shown in Figure 15. 

Note that the specific results in this section are very 
sensitive to many of our assumptions, especially to the 
use of FACK-RH TCP and the 5 packet queue at the 
modem. Different assumptions will change the rela- 
tive positions of the data in our graphs, but the overall 
trends are due to intrinsic properties of TCP congestion 
control and the Congestion Avoidance algorithm. 

We can draw some useful rulcs-of-lhumb from our 
observations. First, each factor of 3 in the MSS (4312 
to 1460, or 1460 to 536) lowers the required end-to-end 
loss rate by nearly an order of magnitude. Furthermore, 
a network which is viewed as excellent by modem users 
can be totally inadequate for a Research and Education 
user. 

8 Conclusion 

Wc have shown, through simulation and live observa- 
tions, that the model in Equation 3 can be used to 
predict the bandwidth of Congestion Avoidance- based 
TCP implementations under many conditions. 

In the simulator all presented TCPs fit the model 
when losses arc infrequent or isolated. However, since 
different TCPs vary in their susceptibility to timeouts, 
they diverge from the model at different points. 

Live Internet tests show rough agreement with the 
model in cases where no pathological behaviors are 
present in the path. 

The model is most accurate when using delay and 
loss instruments in the TCP itself, or when loss is 
randomized at the bottleneck. With non-randomized 
losses, such as drop-tail queues, the model may not be 
able to predict end-to-end performance from aggregate 
link statistics. 

wont to tramfcr & fevr Gigabytes of do to ot one time. 



FACK-RH, which treats multiple packet losses as 
single congestion signals, fits the model across a very 
wide range of conditions. Its behavior is very close to 
ideal TCP Congestion Avoidance. Reno, on the other 
hand, stumbles very easily and deviates from the model 
under fairly ordinary conditions. 

To produce a model that applies to all loss rates, we 
need to have a model for timeout-driven behavior. 

Ovcrbuffering without RED or some other form of 
queue management does not interact well with SACK 
TCP. A single pair of end-systems running SACK over 
a long Internet path without RED are likely to sustain 
persistent, unpleasantly long queues. 

The model can be used to predict how TCP shares 
Internet bandwidth. It can also be used to predict the 
effects of TCP upon the Internet, and represents an 
equilibrium process between loss, delay and bandwidth. 
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A FACK-RH TCP 

The FACK-RH TCP used in the simulations and in 
the TReno experiment is slightly different than the 
FACK version presented at Sigeomm96 [MM 96a]. We 
replaced a Overdamping n and "Rampdown" by a com- 
bined "Rate-Halving" algorithm, which preserves the 
best properties of each. Rate-Halving quickly finds 
the correct window size following packet loss, even un- 
der adverse conditions, while maintaining TCP's Self- 
clock. In addition, wc strengthen the retransmis- 
sion strategy by decoupling it completely from conges- 
tion control considerations during recovery. An algo- 
rithm wc call "Thresholded Retransmission" moves the 
teprexmithresh logic to the SACK scoreboard and ap- 
plies it to every loss, not just the first. We atso add 
"Lost Retransmission Detection" to determine when re- 
transmitted segments have been lost in the network. 

Rate-Halving congestion control adjusts the window 
by sending one segment per two ACKs for exactly one 
round trip during recovery* This sets the new window 
to exactly one-half of the data which was actually held 
in the network during the lossy round trip. At the be- 
ginning of the lossy round trip snd.cwnd segments have 
been injected into the network. Given that there have 
been some losses, we expect to receive (snd.cwnd—loss) 
acknowledgments. Under Rate-Halving we send half as 
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many segments, so the net effect on the congestion win- 
dow is: 



snd.cwnd — 



(15) 



This algorithm can remain in Congestion Avoidance, 
without timing out, at higher loss rates than algorithms 
that wait for half of the packets to drain from the net- 
work when the window is halved. 

We detect when exactly one round trip has elapsed 
by comparing the value of the forward-most SACK 
block in each ACK to the value of snd.nxt saved at 
the time the first SACK block arrived. 

Bounding- Parameters add additional controls to 
guarantee that the final window is appropriate, in spite 
of potential pathological network or receiver behaviors. 
For example, a TCP receiver which sends superfluous 
ACKs could cause Rate-Halving to settle upon an inap- 
propriately large window. Bounding- Parameters assure 
that this and other pathologies still result in reasonable 
windows. Since the Bounding-Parameters have no ef- 
fect under normal operation, they have no effect on the 
results in this paper. 

We are continuing to tinker with some of the de- 
tails of these algorithms, but mostly in areas that have 
only minute effects on normal bulk TCP operations. 
The current state of our TCP work is documented at 
fatt p : //www . psc . edu/network ing/tcp.htral. 

B Reaching Equilibrium 

In several of our simulations and measurements we 
noted that an excessive amount of time was sometimes 
required for TCP to reach equilibrium (steady-state). 

^ One interpretation of Equation 3 is that the average 
window size in packets will tend to C/</p. However, 
during a Slow-start (without Delayed ACKs), the ex- 
pected window sisc is on the order of 1/p when the 
first packet is dropped, 2/p when the loss is detected, 
and back down to 1/p by the end of recovery (assum- 
ing SACK TCP). This window is much too large if p 
is small It then takes roughly log 2 (^^) congestion 
signals to bring the window down to the proper size. 
This requires the delivery of £ log, ^ packets, which 
is large if p is small. 

The effect of this overshoot can be significant. Sup- 
pose p = 1/256 (approximately 0.004) then we have 
l/y/p = 16 and log 2 l/^/p - 4. So it takes roughly 1000 
packets to come into equilibrium. At 1500 bytes/packet, 
this is more than 1.5 Mbytes of data. 

The average window in steady state will be 16 pack- 
ets (24 kbytes). If the path has a 100 ms RTT, tht 
steady state average bandwidth will be close to 2 Mb/s. 
However the peak window and bandwidth might be 
larger by a factor 16: 256 packet (6 Mbytes) and 
30 Mb/s. (This is a factor of y^). The overshoot 
will be this pronounced only if it consumes a negligible 
fraction of the bottleneck link. Clearly this will not be 



the case over most Internet paths so the Slow-start will 
drive up the loss rate (or run out of receiver window) 
causing TCP to converge more quickly. It is unclear 
how significant this overshoot is in the operational In- 
ternet. 

In all of our simulations wc estimate the time re- 
quired for the connection to reach steady-state, and ex* 
dude the initial overshoot when measuring loss, delay 
and bandwidth. 
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Abstract 

In a previous paper [F9 i ] we explored the bias in TCP/IP 
networks against connections with multiple congested 
gateways, for networks with one-way traffic. Using 
simulations and a heuristic analysis, we showed that in 
a network with the window modification algorithm in 
4.3 tahoc BSD TCP and with Random Drop or Drop 
Tail gateways, a longer connection with multiple con- 
gested gateways can receive unacceptably low through- 
put. However, we showed that in a network with no 
bias against connections with longer roundtrip times 
and with no bias against bursty traffic, a connection with 
multiple congested gateways can receive an acceptable 
level of throughput. 

In this paper we show that the the addition of two-way 
traffic to our simulations results in compressed ACKs at 
the gateways, resulting in much more bursty traffic in the 
network. We use Priority gateways to isolate the effects 
of compressed ACKs in networks with two-way traffic;, 
these Priority gateways give priority to small packets 
such as ACK packets, thereby allowing two-way tTaffic 
without compressed ACKs. In our simulations with 
two-way traffic and multiple congested gateways with 
FIFO service and Drop Tail or Random Drop packet- 
drop algorithms, the throughput for the connection with 
multiple congested gateways suffers significantly. This 
loss of throughput can be reduced either by the use of 
Priority gateways, which reduces the burstiness of the 
traffic, or by the use of Random Early Detection (RED) 
gateways, which are designed to itvoid a bias against 
bursty traffic. 



♦This work was supported by the Dirccior, Office of Energy Re- 
search, Scienlific Computing Staff, of the U .S. Department or Energy 
under Contract No. DE-AC03-76SP0OO9S. 



1 Introduction 

In this paper wc investigate the throughput of connec- 
tions in TCPAP networks with two-way traffic and mul- 
tiple congested gateways. 



2 Simulator algorithms 

Our simulator is briefly described in [F91], along with 
the algorithms used in these simulations. 
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Figure I: Simulation network with 5 congested gate r 
ways. 

Figure 1 shows a simulation network for one-way 
traffic, with 6 connections, 10 gateways, and 5 con- 
gested gateways. The congested gateways in Figure 1 
are gateways la, 2a, 3a, 4a, and 5a. The dolled lines 
show the connection paths; source i sends to sink i. 
Each connection has a maximum window just large 
enough sd that, even when that connection is the only 
active connection, the network still occasionally drops 
packets. 

For one-way traffic we use a family of simulation 
networks similar to Figure 1 , where the number of con- 
gested gateways n ranges from 1 to 10. Figure 1 only 
shows the network for n = 5 congested gateways. For 



a simulation network with n congested gateways for 
n > 1 there are n -f 1 connections and 2n galcways. 
Connection 0 passes through multiple congested gate- 
ways, and connections 1 through n each pass through 
one congested gateway. Connection 0 is roughly 2n - I 
times longer than the other n connections. In these sim- 
ulations all connection paths have the same maximum 
bandwidth. 

Definitions: Reduce-to-One, Inereasc-by-One. 
Some of the simulations in the paper use the Rcduce-to- 
One window decrease algorithm and the Increasc-by- 
Onc window increase algorithm from 4.3 tahoe BSD 
TCP [JS8], Our simulator docs not use the 4.3-tahoe 
TCP code directly but we believe it is functionally iden- 
tical. Briefly, there are two phases to the window- 
adjustment algorithm. In the slow-start phase the win- 
dow is doubled each roundtrip lime until it reaches a 
certain threshold. Reaching the threshold causes a tran- 
sition to the congestion-avoidance phase where the win- 
dow is increased by roughly one packet each roundtrip 
time, in this paper we call the increment algorithm used 
in the congestion-avoidance phase ihe Increase-by-One 
algorithm. Packet loss (a dropped packet) is treated 
as a "congestion experienced" signal, The source uses 
timeouts or "fast retransmit" to discover the loss (if four 
ACK. packets acknowledging the same data packet are 
received, the source decides a packet has been dropped) 
and reacts by setting the transition threshold to half the 
current window, decreasing the window to one packet 
and entering the slow-start phase, In this paper we call 
this the Reduce-to-One algorithm. □ 

In order to achieve the highest throughput for longer 
connections, some of these simulations use the Fast Re- 
covery algorithm in 4.3 rcno BSD TCP, along with Se- 
lective Acknowledgements (or SACKs). In this paper 
the Fast Recovery algorithm implemented in 4.3 rcno 
BSD TCP is called the Rcducc-by-Half algorithm. 

Definitions: Rcduce-by-Half window decreases 
and Selective Acknowledgements. With the Reduce- 
by-Half window decrease algorithm, when a packet loss 
is detected by the "fast retransmission" algorithm the 
connection reduces its window by h.tlf For the purposes 
of this paper, the important feature of the Reduce-by- 
Half window decrease algorithm is lhat with the "fast re- 
transmission" algorithm, the source retransmits a packet 
and reduces the window by half, rather than reducing its 
window to one packet. For the simulations in this pa- 
per, we use Selective Acknowledgement sinks [JB88] 
with the Rcducc-by-Half algorithm. With Selective 
Acknowledgement sinks, each ACK acknowledges not 
only the last sequential packet received for that connec- 
tion, but also acknowledges all other (non-sequential) 
packets received. □ 

Some of the simulations in this section use the 



TCP Increase-by-One window-increase algorithm for 
the congestion-avoidance phase of the window- increase 
algorithm. As shown in [FJ91a], this algorithm has a 
bias against connections with longer roundtrip times. In 
order to eliminate this bias, some of our simulations use 
the Constant-Rate algorithm instead in the congestion- 
avoidance phase. Wc are not proposing the Constant- 
Rate algorithm for current networks; we simply arc us- 
ing the Constant-Rate algorithm to explore throughput 
in networks with a window- increase algorithm with no 
bias in favor of shorter- roundtrip-timc connections. 

Definitions: Constant-Rate window increases. In 
the Constant-Rate window-increase algorithm, in the 
congestion-avoidance phase each connection increases 
its window by roughly a * r 2 packets each roundtrip 
lime, for some fixed constant a, and for r the calculated 
average roundtrip time. Using this algorithm, each con- 
nection increases its window by a pkts/scc in each sec- 
ond. For the simulations in this paper, we use a = 4. 
□ 

In this paper we examine networks with Drop Tail, 
Random Drop, and RED gateways. As shown in 
[FJ91a], simulations and measurement studies with 
Drop Tail gateways arc vulnerable to traffic phase ef- 
fects; small changes in network parameters can result 
in large changes in the performance of the network. In 
order to avoid these phase effects in networks with Drop 
Tail gateways, in this paper the simulator adds a small 
random component to the roundtrip time for each packet 
in simulations using Drop Tail gateways. (This is dis- 
cussed in [FJ91a].) Normally, our simulator charges 
zero seconds for the time required to process packets at 
the nodes. In this paper each source node adds a ran- 
dom lime uniformly distributed in [0, 6], for h « 5.3 
ms. die bottleneck service time of the network, to the 
time required by the source node to process each ACIC 
packet in the simulations with Drop Tail gateways. This 
is not intended to model any assumptions about realistic 
network behavior, but to eliminate problems with traffic 
phase effects. With this added random component the 
simulations with Drop Tail gateways give similar results 
to the simulations with Random Drop gateways. 

To avoid the bias againsl bursty traffic common to 
Random Drop and to Drop Tail gateways, we also ex- 
amine performance in simulations with Random Early 
Detection (RED) gateways, a modified version of Ran- 
dom Drop gateways that detect incipient congestion. 
RED gateways maintain an upper bound on the average 
queue size at the gateway. 

Definitions: RED gateways. With our implementa- 
tion of RED gateways [FJ9lc], the gateway computes 
the average size for each queue using an exponential 
weighted moving average. When the average queue 
size exceeds a certain threshold, indicating incipient 



congestion, the gateway randomly chooses a packet to 
drop and increases the threshold. As more packets ar- 
rive at the gateway, the threshold slowly decreases to 
its previous value. The gateway chooses a packet to 
drop by choosing a random number n in the interval I 
to range, where range is a variable parameter of the 
gateway. The gateway drops the aih packet to arrive 
at the gateway. RED gateways arc described in more 
detail in a paper currently in progress [FJ9 1 c]. D 

One advantage of RED gateways is that, unlike Drop 
Tail and Random Drop gateways. RED gateways do 
not have a bias against bursty traffic. The bias of Drop 
Tail and of Random Drop gateways against bursty traf- 
fic and the correction of this bias in RED gateways are 
described in [FJ91a], With Drop Tail or Random Drop 
gateways, the more bursty the traffic, the more likely it 
is that the queue will overflow and the Drop Tail or Ran- 
dom Drop gateway will drop a packet. This is because 
a burst of packets results in a temporary increase in the 
queue size at the gateway. With RI-D gateways the de- 
tection of congestion depends on the average queue size, 
not on the maximum queue size. Thus with RED gate- 
ways bursty traffic is less likely to result in the detection 
of congestion. With a RED gateway even when bursty 
traffic results in the detection of congestion at the gate- 
way, the mechanism for dropping packets ensures that 
the bursty connection does not have a disproportionate 
probability of having a packet dropped. □ 

For the RED simulations in this paper the maximum 
queue size is 60 packets, and the RED gateways drop 
packets when the average queue si/e is between 10 and 
20 packets. (The range from 10 to 20 packets for the 
average queue size for RED gateways is somewhat arbi- 
trary; the optimum average queue size is still a question 
for further research.) For the two-way traffic simula- 
tions with Random Drop and Drop Tail gateways, the 
maximum queue size is 20 packets ( I have to rerun these 
simulations for a maximum queue of 60 packets). The 
maximum windows for connections 1 to n arc set suf- 
ficiently large to force occasional packet drops even in 
the absence of traffic from connection 0. 



Tail) gateways. We show that with Random Drop gate- 
ways, the addition of two-way traffic causes a substan- 
tial loss in throughput for the longer connection. With 
RED gateways, with their increased ability to accomo- 
date bursty traffic, the addition of two-way traffic causes 
only a small change in the behavior of the network. 
We use gateways with Priority service, which give pri- 
ority to small packets, to compare the throughput in 
networks with two-way traffic with and without com- 
pressed ACKs. 
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Figure 2: Simulation network with two-way traffic. 

Figure 2 shows the simulation network with two- 
way traffic and with n congested forward gateways, for 
n = 5. Connection 0 goes from source 0 to sink 0. Gate- 
ways la to 5a are congested in connection 0's forward 
direction, and gateways 16 to 56 are congested in the 
reverse direction. With this two-way traffic, connection 
0 suffers from significant ACIC-compresston. All of the 
connections in this network can experience compressed 
ACKs, but because connection 0 passes through mul- 
tiple congested gateways, the problem of compressed 
ACKs is more severe for connection 0. All of the sim- 
ulations in this section use Constant-Rate window in- 
creases and Reduce-by-Half decreases. 



3 Simulations with two-way traffic 

The simulations in Section 2 were of a network with 
one-way traffic. In this section, we explore simula- 
tions with two-way traffic. As discussed in [WRM91] 
and [ZC91], two-way traffic introduces the added com- 
plication of ACK-comprcssion, caused by small ACK 
packets being queued at a congested gateway with no in- 
terleaving larger data packets. This ACK-compression 
increases the burstiness of the traffic, causing substan- 
tial problems for networks with Random Drop (or Drop 
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Figure 3; RED gateways, two-way traffic, FIFO service. Figure 5: Random-Drop gateways, two-way traffic, 

FIFO service. 




Figure 4: RED gateways, two-way traffic, Priority ser- 
vice. 

Simulations with two-way traffic and RED gateways, 
with Constant-Rate increases and Reduce-by- Ha If de- 
creases. 
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Figure 6: Random-Drop gateways, two-way traffic, Pri- 
ority service.' 

Simulations with two-way traffic and Random-Drop 
gateways, with Constant-Rate increases and Reducc- 
by-Half decreases. 
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Figure 7: Packet trains with one-way traffic. 
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Figure 8: Packet trains with two-way traffic, FIPO service. 
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Figure 9: Packet trains with two-way traffic, Priority service. 
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Figures 3 and 5 show the results of simulations with 
the network in Figure 2. The simulations in Figure 3 
use RED gateways. These differ from the simulations 
in Figure 6 in Part 1 only with the addition of traffic 
in the reverse direction. For the simulations in Figure 
3, connections 0 and 1 both receive somewhat reduced 
throughput, due to the effects of the two-way traffic, but 
the throughput for connection 0 is still substantial. For 
the simulations in this section, boih the RED and the 
Random- Drop gateways measure their queues in pack- 
ets rather than in bytes. Because of this, the addition 
of two-way traffic adds significantly to the congestion 
in the network, If our gateways measured queue-length 
in bytes rather than in packets, this would reduce the 
impact of two-way traffic. 

The simulations in Figure 5 use Random Drop gate- 
ways, and differ from the simulations in Figure 6 in Part 
I only with the addition of traffic in the reverse direc- 
tion. For the simulations in Figure 5, the throughput 
for connection 0 is significantly reduced by the addition 
of two-way traffic, even for a small number of con- 
gested gateways. Because of the compressed ACKs for 
connection 0, the pattern of data packets transmitted 
by connection 0 is fairly bursty. With Random Drop or 
Drop Tail gateways, this burstincss increases connection 
0*s chances of having packets dropped at the gateways. 
With RED gateways, which are designed to accomo- 
date bursty traffic as much as possible, this burstincss 
has less of an effect on network behavior. 

3.1 Packet trains with two-way traffic 

Definitions: packet train. For the network in Figure 1, 
the bottleneck service time is b - 5.33 ms., and it takes 
a source node s = 0.8 ms. to transmit a data packet 
to the gateway. With one-way traffic, ACK packets 
arrive at the source node at least (* ms. apart, and the 
source node therefore sends data packets at least b ms, 
apart, in the absence of window increases. However, 
the source node is capable of sending data packets at 
js-ms. intervals. This can occur as :t result of a window 
increase, or as a result of compressed ACKs that arrived 
at the source node close together. For the purposes of 
this paper, we define a packet train as a sequence of 
packets that leave the source node at s-ms. intervals, 
faster than the bottleneck service rate. □ 

Figure7 shows the packet train lengths forconnection 
0 for the first 50 seconds of a simulm ion similar to Figure 
6 in Part 1, with one-way traffic and 10 congested gate- 
ways. There arc many packet trains of length I, with 
single packets, and also many packet trains of length 2, 
which occur each time connection 0 increases its win- 
dow by one packet. There is also one packet train of 
length 3, and one packet train of length 8. These result 



from the window opening up after the Reduce-by-Half 
window decrease algorithm. 

With the Reduce-by-Half window decrease algo- 
rithm, a connection effectively waits half a round trip 
time after retransmitting a dropped packet, and reduces 
its window to half of its previous value. If only one 
packet has been dropped, then the data packets that arc 
transmitted are all clocked by incoming ACK pack- 
ets. However, due to details of the Reducc-by-Half 
algorithm, if many packets have been dropped in one 
roundtrip time, then the Reducc-by-Half window de- 
crease algorithm might open its window to half of its 
previous value all at once when the ACK packet for the 
retransmitted packet is received. This results in a burst 
of packets sent by the source node. Connection 0 in 
Figure 2 has a large maximum window, and therefore is 
likely to have many packet drops during the initial slow- 
start phase of doubling the window each roundtrip time. 
This accounts for the long packet train in Figure 7 (and 
Figure 9). 

FigureS shows the packet train lengths forconnection 
0 for the first 50 seconds of the simulation as in Figure 
3, with two-way traffic and 10 congested forward gate- 
ways. There are many packet trains with from 10 to 20 
packets, and a few longer packet trains. The lefthand 
chart of Figure S shows the number of packet trains of 
length m, and the righthand chart shows the number of 
packets in packet trains of length m. As the righthand 
chart shows, only a small fraction of the packets are in 
packet trains of length one or two. Almost all of the 
packet trains of length greater than two arc caused by 
compressed ACKs. 

Two ACK packets are compressed when the first 
ACK arrives at a gateway to a queue, and the sec- 
ond ACK packet arrives before the first ACK has been 
transmitted and also before any other intervening pack- 
ets. Once two ACK packets have been compressed, it 
is fairly difficult for them to become separated again; 
another packet has to arrive at some succeeding gate- 
way in the short lime between the arrival of the first 
ACK packet and the arrival of the second ACK packet. 
Therefore, as the number of gateways with occasional 
queues increases, the problem of compressed ACKs also 
increases. 

3.2 Priority gateways 

One way to demonstrate the effects of compressed 
ACKs is to run simulations with gateways that give pri- 
ority to small packets, and to compare these results to 
those from simulations with gateways with FIFO ser- 
vice. 

Definitions: Priority gateways. For gateways with 
Priority service, we defined small packets as being at 
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most 50 bytes, and \vc defined larg<; packets larger lhan 
50 byics. In our simulations, all data packets arc defined 
large, and all ACK packets are defined as small. The 
Priority gateway transmits packets using FIFO service, 
with one exception. After a Priority gateway sends a 
large n-bytc packet, then if there ;ire any small pack- 
ets, the Priority gateway sends at most n bytes of small 
packets before proceeding with the next large packet 
in the queue. The assumption is that after the current 
large packet is served, a Priority gaieway will generally 
send all outstanding small packets before sending the 
next large packet. Thus, small packets that arrive at 
the gateway queue when the server is busy will usually 
be transmitted after the gateway finishes transmitting 
the current large packet. Large packets will have their 
waiting time in the queue increased by at most a factor 
of two. Priority gateways are easily implemented, and 
don't require separate state for each connection. For our 
simulator, either Priority or FIFO service can be chosen 
with either Drop Tail, Random Drop, or RED gateways. 
□ 

Compressed ACKs arc much less likely to occur with 
Priority service than with FIFO service, With Priority 
service, each time an ACK packet arrives at a gateway 
to a busy server, the ACK packet waits at least until the 
current packet has been served before it can be trans- 
mitted. Thus, each Priority gateway adds a certain jitter 
to the relative timing of the ACK packets. This is quite 
different from the behavior of FI FO service, where each 
ACK packet might wait a substantial time in a queue. 

Figure 4 shows the results of simulations with RED 
gateways with Priority service. Connection 0's through- 
put is somewhat better in Figure 4 than it was in Figure 
3 . Thus, we can sec that the compressed ACKs in Figure 
3 did have a (small) effect on throughput in the network. 
Figure 9 shows the lengths of packet trains for connec- 
tion 0 for a 50-second simulation as in Figure 4. With 
Priority service and two-way traffic, almost all packet 
trains are of length one or two. The one packet train of 
length 35 is the result of the Rcduce-by-Half window 
decrease algorithm, as explained above. 

Figure 6 shows the result of simulations with Random 
Drop gateways with Priority service. Because the use 
of Priority service decreases the burstincss of the traf- 
fic, the use of Priority service increases the throughput 
for connection 0 significantly, with Random-Drop gate- 
ways. Because the gateways in these simulations use 
queues measured in packets rather lhan in bytes, the ad- 
dition of two-way traffic still significantly increases the 
congestion in the network, even w ith Priority sen' ice. 
This accounts for the difference in pcrfonnance from 
Figure 6 from Part I with one-way traffic, and Figure 6 
with two-way traffic with Priority izateways. 

The main conclusions of this section arc that with 



Random Drop (or Drop Tail) gateways, the introduction 
of two-way traffic can lead to a significant increase in 
burstiness for traffic with multiple congested gateways, 
and this increase in burstincss can result in a decrease 
in throughput. For gateways such as RED gateways 
that avoid a bias against bursty traffic, the effects of 
an introduction of two-way traffic are much more mod- 
erate. Because Priority gateways allow us to look at 
networks with two-way traffic but without compressed 
ACKs, Priority gateways are a useful tool for exploring 
the effects of compressed ACKs in networks. 

We do not necessarily propose implementing Prior- 
ity gateways in current networks. For a network with 
Priority gateways, extra care would have to be given to 
the problems caused by gateways transmitting packets 
from one connection out-of-order. Packet-switched net- 
works do not guarantee in-order delivery of packets in 
any case, but the problems of out-of-order packets could 
be intensified by the use of Priority gateways. This 
could occur, for example, when some data packets from 
a connection arc small, and other data packets from the 
same connection are large. (This does not occur in our 
simulations, because in our simulations all data packets 
are defined as large, and all ACK packets arc defined as 
small.) At the moment, we are using Priority gateways 
simply to examine the effects of compressed ACKs at 
the gateway. 

4 Conclusions and Future Work 

For our simulations with multiple congested gateways 
and one-way traffic, the pcrfonnance of Random Drop 
gateways and of RED gateways arc fairly similar. For 
our simulations with two-way traffic, however, the traf- 
fic for the connection with multiple congested gateways 
is quite bursty, due to compressed ACKs. In this case, 
there is a significant difference in performance between 
Random Drop and RED gateways. Because of their 
bias against bursty traffic (a bias shared by Drop Tail 
gateways), the use of Random Drop gateways results 
in reduced throughput for the connection with multi- 
ple congested gateways. For RED gateways, which are 
designed to accomodate bursty traffic, the introduction 
of two-way traffic, and the corresponding increase of 
burstiness, docs not result in a significant reduction in 
throughput for the connection with multiple congested 
gateways. 

In order to explore the effect of compressed ACKs in 
simulations with two-way traffic, we have introduced 
gateways with Priority service, which give priority to 
smaller packets such as ACK packets. With the tra- 
ditional FIFO service, ACK packets arriving at a con- 
gested gateway wait their tiirn in the queue, and thcrc- 
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fore il is possible to two successive ACK packets to be 
"compressed" in the queue. With Priority service, af- 
ter each data packet is transmitted, (almost all) waiting 
ACK packets are transmitted before the next data packet 
in the queue. In this way, with Priority service com- 
pressed ACKs are avoided. We show that with Random 
Drop gateways the throughput for a connection with 
multiple congested gateways is acceptable in simula- 
tions with two-way traffic with Priority service (and no 
compressed ACKs), but that the throughput is not ac- 
ceptable in simulations with two-way traffic with FIFO 
service in the gateway (with compressed ACKs). It is 
not the two-way traffic itself that degrades performance 
for the connection with multiple congested gateways, 
but specifically the compressed ACKs, with the result- 
ing increased burstiness of the traffic. 
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